Similarity of documents to a dictionary — cossim2dict • dictvectoR

Computes the cosinal similarity between the average word vector representation of each document in a data frame and the average word vector representation of a dictionary, using fasttext word vector model.

Usage

cossim2dict(
  df,
  dictionary,
  model,
  text_field = "text",
  replace_na = c("mean-sd", "min", 0, F)
)

Arguments

df

A dataframe containing one document per row.

dictionary

A character vector containing the keywords of your dictionary.

model

A fasttext model as loaded by load_model.

text_field

Name of column in df that contains the text of the documents. Default is "text".

replace_na

Specifies the value used to replace NAs. Default is 'mean-sd'. Can take values:

'mean-sd' (charcter): replace NAs by mean - 1sd. Default.
'min' (charcter): replace NAs by minimum.
0 (numerical): replace NAs by 0.
FALSE (logical): do not replace NAs.

Value

Numerical. Cosinal similarity, ranging (theoretically) from -1 to +1. Indicating the similarity between the average fasttext word vector of all words in dictionary

and the average fasttext word vector of each document in df.

Details

Implements the method called 'Distributed Dictionary Representation' (DDR), introduced by Garten et al. (2018).

The average dictionary vector is calculated as the mean vector of all words in a dictioary, stored as a character vector. The document vectors are calculated as mean vectors of all words per observation in a column named 'text' of a dataframe. One row in this dataframe represents one document. Both, the average dictionary vector and document vectors are L2 normalized. The function returns the cosinal similarity to the dictionary vector for each document in the dataframe.

References

Garten, J., Hoover, J., Johnson, K. M., Boghrati, R., Iskiwitch, C., & Dehghani, M. (2018). Dictionaries and distributions: Combining expert knowledge and large scale textual data content analysis. Behavior Research Methods, 50(1), 344 - 361. https://doi.org/10.3758/s13428-017-0875-9

Examples

model <- fastrtext::load_model(system.file("extdata",
"tw_demo_model_sml.bin", package = "dictvectoR"))
tw_annot %<>% head(100) %>% clean_text(remove_stopwords = TRUE,
                                       text_field = "full_text")
dict <- c("skandal", "deutschland", "steuerzahler")
tw_annot$ddr <- cossim2dict(tw_annot, dict, model)