Determine which similar terms to drop — drop

Takes the pairwise similarity table returned from detect_similar_words. Returns a data.frame of terms that should be dropped according to the specified decision rules. The function can compare the words in the similarity table along two aspects:

The number of comparison wins regarding the score specified by 'compare_by' in detect_similar_words, resp. remove_similar_words. This variable must be stored in the similarity table as 'wins_score1.
The number of comparison wins regarding the frequency of occurrences, set as 'compare_hits = T' in detect_similar_words, resp. remove_similar_words. This variable must be stored in the similarity table as 'wins_hits1.

Both variables can be compared at the same time. If 'wins_score1' is missing, the function will compare 'hits'.

Usage

drop_which(simil_table, compare_hits = T, win_threshold = 0.5)

Arguments

simil_table: data.frame. A pairwise similarity table returned from detect_similar_words.
compare_hits: logical. Default TRUE. If TRUE, will consider 'hits' in comparison.
win_threshold: Numerical (0-1). Default .5. Determines the threshold to drop words, defined as proportion of won pairwise comparisons. If hits and scores are compared, it is the mean of both proportions. Words will be suggested for dropping if the computed value is smaller than the value set here.

Value

A data.frame with words suggested for dropping.

Examples

model <- fastrtext::load_model(system.file("extdata",
                                          "tw_demo_model_sml.bin",
                                           package = "dictvectoR"))
set.seed(1)
word_df <- data.frame(words = c("unsere steuern", "steuerzahler",
                               "unsere", "steuern"),
hits = c(2, 3, 15, 4), score = rnorm(4))
sim_t <- detect_similar_words(word_df, model, compare_by = "score")
drop_which(sim_t, compare_hits = TRUE, win_threshold = .4)
#> # A tibble: 2 × 6
#>   word_id1 words          word_id wins_score_pct wins_hits_pct   win
#>   <chr>    <chr>          <chr>            <dbl>         <dbl> <dbl>
#> 1 1        unsere steuern 1                  0.5             0  0.25
#> 2 2        steuerzahler   2                  0               0  0