Compares a word-vector representations of words to the representations of an annotated data.frame of texts. Aims to detect words that distinctively characterize a concept.
Usage
find_distinctive(
df,
concept_field,
text_field = "text",
word_df,
word_field = "words",
model
)
Arguments
- df
A data.frame containing one annotated document per row.
- concept_field
character. Name of the column that contains a binary, (hand-coded) indicator of the presence of absence of a concept.
- text_field
character. Name of column in
df
that contains the text of the documents. Default is "text".- word_df
A data.frame containing a column with words or multi-word expressions.
- word_field
character. The name of the column in word_df that contains the words.
- model
A fastText model, loaded by
fastrtext::load_model()
.
Details
Takes an annotated data.frame df
of texts as input.
The varialbe specified by concept_field
in this df indicates the presence or
absence of a theoretical concept in the text.
Two average word-vector representations are computed, using a fastText model
:
One for all texts that contain the concept, and one for those that do not.
A second data.frame, word_df
, contains one (multi-)word per row in 'word_field'
.
Three new columns in word_df
are created: The first, ending with '_possim'
,
indicates the cosine similarity between the
word and the positive concept corpus. The second '_negsim'
,
indicates the similarity the remaining corpus.
The third, ending with '_distinctive'
is the difference between the two.
See also
get_corpus_representation()
, find_distinctive()
Examples
model <- fastrtext::load_model(system.file("extdata",
"tw_demo_model_sml.bin",
package = "dictvectoR"))
tw_annot %<>% clean_text(text_field = "full_text")
word_df <- data.frame(words = c("skandal", "deutschland", "wundervoll"))
find_distinctive(tw_annot,
"pop",
word_df = word_df,
model = model)
#> words pop_possim pop_negsim pop_distinctive
#> 1 skandal 0.5809499 0.5490425 0.03190745
#> 2 deutschland 0.7616619 0.7402692 0.02139269
#> 3 wundervoll 0.6919564 0.7408727 -0.04891631