Skip to contents

Efficiently computes F1 scores for all elements of a vector containing keywords, or a list containing dictionaries, when used in DDR method.

Usage

get_many_F1s(
  words,
  model,
  df,
  reference,
  text_field = "text",
  replace_na = c("mean-sd", "min", 0, F)
)

Arguments

words

A character vector containing keywords, or a list of character vectors containing dictionaries.

model

A fastText model, loaded by fastrtext::load_model().

df

A data.frame containing one annotated document per row.

reference

Name of the binary reference column in df (character).

text_field

Name of column in df that contains the text of the documents. Default is "text".

replace_na

Specifies the value used to replace NAs in the DDR measurement. Default is 'mean-sd'. Can take values:

  • 'mean-sd' (charcter): replace NAs by mean - 1sd. Default.

  • 'min' (charcter): replace NAs by minimum.

  • 0 (numerical): replace NAs by 0.

  • FALSE (logical): do not replace NAs.

Details

A numerical F1 score is returned for each element (i.e. word or dictionary) of the vector or list. The F1 scores indicate the performance of these words/dictionaries in predicting a binary coding, when used in the DDR method. The resulting gradual measure from the DDR measure is passed to a logistic regression, with the binary coding as dependent variable. Binary predictions are calculated from this logistic model and compared with the binary coding. The F1 score is the harmonic mean between Recall and Precision (1).

References

(1) Chinchor, N. (1992). MUC-4 evaluation metrics. Proceedings of the 4th Conference on Message Understanding, 22–29. https://doi.org/10.3115/1072064.1072067

Examples

model <- fastrtext::load_model(system.file("extdata",
                               "tw_demo_model_sml.bin",
                                package = "dictvectoR"))
tw_annot %<>% clean_text(text_field = "full_text")
dict_df <- data.frame(id = 1:3)
dict_df$combis <- list(c("mehrheit deutschen", "merkel", "skandal"),
                      c("steuerzahler", "bundesregierung",
                      "komplett gescheitert"),
                      c( "arbeitnehmer", "groko", "wahnsinn"))
dict_df$F1 <- get_many_F1s(dict_df$combis,
                           model = model,
                           df = tw_annot,
                           reference = "pop")