Computes the cosinal similarity between the average word vector representation of each document in a data frame and the average word vector representation of a dictionary, using fasttext word vector model.
Usage
cossim2dict(
df,
dictionary,
model,
text_field = "text",
replace_na = c("mean-sd", "min", 0, F)
)
Arguments
- df
A dataframe containing one document per row.
- dictionary
A character vector containing the keywords of your dictionary.
- model
A fasttext model as loaded by
load_model
.- text_field
Name of column in
df
that contains the text of the documents. Default is "text".- replace_na
Specifies the value used to replace NAs. Default is 'mean-sd'. Can take values:
'mean-sd' (charcter): replace NAs by mean - 1sd. Default.
'min' (charcter): replace NAs by minimum.
0 (numerical): replace NAs by 0.
FALSE (logical): do not replace NAs.
Value
Numerical. Cosinal similarity, ranging (theoretically) from -1 to +1.
Indicating the similarity between the average fasttext word vector of all words
in dictionary
and the average fasttext word vector of each document in df
.
Details
Implements the method called 'Distributed Dictionary Representation' (DDR), introduced by Garten et al. (2018).
The average dictionary vector is calculated as the mean vector of all words in a dictioary, stored as a character vector. The document vectors are calculated as mean vectors of all words per observation in a column named 'text' of a dataframe. One row in this dataframe represents one document. Both, the average dictionary vector and document vectors are L2 normalized. The function returns the cosinal similarity to the dictionary vector for each document in the dataframe.
References
Garten, J., Hoover, J., Johnson, K. M., Boghrati, R., Iskiwitch, C., & Dehghani, M. (2018). Dictionaries and distributions: Combining expert knowledge and large scale textual data content analysis. Behavior Research Methods, 50(1), 344 - 361. https://doi.org/10.3758/s13428-017-0875-9
Examples
model <- fastrtext::load_model(system.file("extdata",
"tw_demo_model_sml.bin", package = "dictvectoR"))
tw_annot %<>% head(100) %>% clean_text(remove_stopwords = TRUE,
text_field = "full_text")
dict <- c("skandal", "deutschland", "steuerzahler")
tw_annot$ddr <- cossim2dict(tw_annot, dict, model)