Skip to contents

Compares a word-vector representations of words to the representations of an annotated data.frame of texts. Aims to detect words that distinctively characterize a concept.

Usage

find_distinctive(
  df,
  concept_field,
  text_field = "text",
  word_df,
  word_field = "words",
  model
)

Arguments

df

A data.frame containing one annotated document per row.

concept_field

character. Name of the column that contains a binary, (hand-coded) indicator of the presence of absence of a concept.

text_field

character. Name of column in df that contains the text of the documents. Default is "text".

word_df

A data.frame containing a column with words or multi-word expressions.

word_field

character. The name of the column in word_df that contains the words.

model

A fastText model, loaded by fastrtext::load_model().

Details

Takes an annotated data.frame df of texts as input. The varialbe specified by concept_field in this df indicates the presence or absence of a theoretical concept in the text. Two average word-vector representations are computed, using a fastText model: One for all texts that contain the concept, and one for those that do not. A second data.frame, word_df, contains one (multi-)word per row in 'word_field'. Three new columns in word_df are created: The first, ending with '_possim', indicates the cosine similarity between the word and the positive concept corpus. The second '_negsim', indicates the similarity the remaining corpus. The third, ending with '_distinctive' is the difference between the two.

See also

get_corpus_representation(), find_distinctive()

Examples

model <- fastrtext::load_model(system.file("extdata",
                               "tw_demo_model_sml.bin",
                                package = "dictvectoR"))
tw_annot %<>% clean_text(text_field = "full_text")
word_df <- data.frame(words = c("skandal", "deutschland", "wundervoll"))
find_distinctive(tw_annot,
                 "pop",
                  word_df = word_df,
                  model = model)
#>         words pop_possim pop_negsim pop_distinctive
#> 1     skandal  0.5809499  0.5490425      0.03190745
#> 2 deutschland  0.7616619  0.7402692      0.02139269
#> 3  wundervoll  0.6919564  0.7408727     -0.04891631