Efficiently computes F1 scores for all elements of a vector containing keywords, or a list containing dictionaries, when used in DDR method.
Usage
get_many_F1s(
words,
model,
df,
reference,
text_field = "text",
replace_na = c("mean-sd", "min", 0, F)
)
Arguments
- words
A character vector containing keywords, or a list of character vectors containing dictionaries.
- model
A fastText model, loaded by
fastrtext::load_model()
.- df
A data.frame containing one annotated document per row.
- reference
Name of the binary reference column in
df
(character).- text_field
Name of column in
df
that contains the text of the documents. Default is "text".- replace_na
Specifies the value used to replace NAs in the DDR measurement. Default is 'mean-sd'. Can take values:
'mean-sd'
(charcter): replace NAs by mean - 1sd. Default.'min'
(charcter): replace NAs by minimum.0
(numerical): replace NAs by 0.FALSE
(logical): do not replace NAs.
Details
A numerical F1 score is returned for each element (i.e. word or dictionary) of the vector or list. The F1 scores indicate the performance of these words/dictionaries in predicting a binary coding, when used in the DDR method. The resulting gradual measure from the DDR measure is passed to a logistic regression, with the binary coding as dependent variable. Binary predictions are calculated from this logistic model and compared with the binary coding. The F1 score is the harmonic mean between Recall and Precision (1).
References
(1) Chinchor, N. (1992). MUC-4 evaluation metrics. Proceedings of the 4th Conference on Message Understanding, 22–29. https://doi.org/10.3115/1072064.1072067
Examples
model <- fastrtext::load_model(system.file("extdata",
"tw_demo_model_sml.bin",
package = "dictvectoR"))
tw_annot %<>% clean_text(text_field = "full_text")
dict_df <- data.frame(id = 1:3)
dict_df$combis <- list(c("mehrheit deutschen", "merkel", "skandal"),
c("steuerzahler", "bundesregierung",
"komplett gescheitert"),
c( "arbeitnehmer", "groko", "wahnsinn"))
dict_df$F1 <- get_many_F1s(dict_df$combis,
model = model,
df = tw_annot,
reference = "pop")