Remove too similar terms — remove_similar

Removes similar terms from a data.frame word_df.

Usage

remove_similar_words(
  word_df,
  model,
  word_field = "words",
  compare_by = NULL,
  compare_hits = T,
  min_simil = 0.7,
  win_threshold = 0.5
)

Arguments

word_df: A data.frame containing a column with words or multi-word expressions.
model: A fastText model, loaded by load_model.
word_field: character. The name of the column in word_df that contains the words.
compare_by: character. Default NULL. The name of a column that should be compared.
compare_hits: logical. Default TRUE. If true counts how often one word 'beats' its similar other in regard to occurrences.
min_simil: Numerical (0-1). Default .7. Similarity threshold. Word pairs below this threshold are considered dissimilar.
win_threshold: Numerical (0-1). Default .5. Determines the threshold to drop words, defined as proportion of won pairwise comparisons; resp., if both compare_by and compare_hit are set, the mean of both proportional wins.

Value

A data.frame. Containing a pairwise similarity table of all similar words.

Details

Detects similar words and multiword expressions in word_df, using a fastText model. The cosine similarity threshold is set by min_simil. If requested, similar words are compared along a variable specified in compare_by, and/or by the number of occurrences stored in the variable named hits.

The threshold for dropping terms is set by win_threshold. This indicates the proportion of how many of the pairwise comparisons are 'won' by the word in question. If compare_hits = T and a compare_by is set, it is the mean of both proportions. If a word 'wins' all comparisions regarding frequency (wins_hits1 == 1), but loses all comparisions regaring the set score ('wins_score1' == 0), this value is .5.

Forces 'word_df' to contain a unique identifier called 'word_id'; re-uses the first variable named 'id' that is such a unique identifier.

Examples

model <- fastrtext::load_model(system.file("extdata",
                                          "tw_demo_model_sml.bin",
                                          package = "dictvectoR"))
set.seed(1)
word_df <- data.frame(words = c("unsere steuern",
                               "steuerzahler",
                               "unsere",
                               "steuern"),
                     hits = c(2, 3, 15, 4),
                     score = rnorm(4))
remove_similar_words(word_df,
                    model,
                    compare_by = "score",
                    compare_hits = FALSE,
                    win_threshold = .4)
#>            words hits      score word_id
#> 1 unsere steuern    2 -0.6264538       1
#> 2        steuern    4  1.5952808       4