Takes the pairwise similarity table returned from detect_similar_words
.
Returns a data.frame of terms that should be dropped according to the specified
decision rules.
The function can compare the words in the similarity table along two aspects:
The number of comparison wins regarding the score specified by
'compare_by'
indetect_similar_words
, resp.remove_similar_words
. This variable must be stored in the similarity table as'wins_score1
.The number of comparison wins regarding the frequency of occurrences, set as
'compare_hits = T'
indetect_similar_words
, resp.remove_similar_words
. This variable must be stored in the similarity table as'wins_hits1
.
Both variables can be compared at the same time. If 'wins_score1' is missing, the function will compare 'hits'.
Arguments
- simil_table
data.frame. A pairwise similarity table returned from
detect_similar_words
.- compare_hits
logical. Default
TRUE
. If TRUE, will consider 'hits' in comparison.- win_threshold
Numerical (0-1). Default .5. Determines the threshold to drop words, defined as proportion of won pairwise comparisons. If hits and scores are compared, it is the mean of both proportions. Words will be suggested for dropping if the computed value is smaller than the value set here.
Examples
model <- fastrtext::load_model(system.file("extdata",
"tw_demo_model_sml.bin",
package = "dictvectoR"))
set.seed(1)
word_df <- data.frame(words = c("unsere steuern", "steuerzahler",
"unsere", "steuern"),
hits = c(2, 3, 15, 4), score = rnorm(4))
sim_t <- detect_similar_words(word_df, model, compare_by = "score")
drop_which(sim_t, compare_hits = TRUE, win_threshold = .4)
#> # A tibble: 2 × 6
#> word_id1 words word_id wins_score_pct wins_hits_pct win
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1 unsere steuern 1 0.5 0 0.25
#> 2 2 steuerzahler 2 0 0 0