Skip to contents

Efficiently computes Recall, Precision, and F1 scores for a character vector of keywords stored in a data.frame, or a list of dictionaries stored in a data.frame. Adds Recall, Precision, and F1 to the data.frame.

Usage

get_many_RPFs(
  keyword_df,
  keyword_field = "words",
  model,
  text_df,
  reference,
  text_field = "text",
  replace_na = c("mean-sd", "min", 0, F)
)

Arguments

keyword_df

A data.frame, containing a column with a character vector of words, or with a list of dictionaries.

keyword_field

character. The name of the column in keyword_df that is either a character vector of single keywords, or a list of dictionaries, stored as separate character vectors with one element per word.

model

A fastText model, loaded by load_model.

text_df

A data.frame containing one annotated document per row.

reference

Name of the binary reference column in df (character).

text_field

Name of column in df that contains the text of the documents. Default is "text".

replace_na

Specifies the value used to replace NAs in the DDR measurement. Default is 'mean-sd'. Can take values:

  • 'mean-sd' (charcter): replace NAs by mean - 1sd. Default.

  • 'min' (charcter): replace NAs by minimum.

  • 0 (numerical): replace NAs by 0.

  • FALSE (logical): do not replace NAs.

Details

For each element (i.e. word or dictionary) of the character vector or list in the data.frame . The F1 scores indicate the performance of these words/dictionaries in predicting a binary coding, when used in the DDR method. The resulting gradual measure from the DDR measure is passed to a logistic regression, with the binary coding as dependent variable. Binary predictions are calculated from this logistic model and compared with the binary coding. The F1 score is the harmonic mean between Recall and Precision (Chinchor, 1992).

References

Chinchor, N. (1992). MUC-4 evaluation metrics. Proceedings of the 4th Conference on Message Understanding, 22–29. https://doi.org/10.3115/1072064.1072067

Examples

model <- fastrtext::load_model(system.file("extdata",
                               "tw_demo_model_sml.bin",
                                package = "dictvectoR"))
tw_annot %<>% clean_text(text_field = "full_text")
dict_df <- data.frame(id = 1:3)
dict_df$combis <- list(c("mehrheit deutschen", "merkel", "skandal"),
                      c("steuerzahler", "bundesregierung",
                      "komplett gescheitert"),
                      c( "arbeitnehmer", "groko", "wahnsinn"))
get_many_RPFs(keyword_df = dict_df,
       keyword_field = "combis",
       model = model,
       text_df = tw_annot, reference = "pop")
#> Joining, by = "rowid"
#>   id                                              combis     recall precision
#> 1  1                 mehrheit deutschen, merkel, skandal 0.08796296 0.5428571
#> 2  2 steuerzahler, bundesregierung, komplett gescheitert 0.16666667 0.6101695
#> 3  3                       arbeitnehmer, groko, wahnsinn 0.05092593 0.6111111
#>           F1
#> 1 0.15139442
#> 2 0.26181818
#> 3 0.09401709