Removes similar terms from a data.frame word_df
.
Usage
remove_similar_words(
word_df,
model,
word_field = "words",
compare_by = NULL,
compare_hits = T,
min_simil = 0.7,
win_threshold = 0.5
)
Arguments
- word_df
A data.frame containing a column with words or multi-word expressions.
- model
A fastText model, loaded by
load_model
.- word_field
character. The name of the column in word_df that contains the words.
- compare_by
character. Default
NULL
. The name of a column that should be compared.- compare_hits
logical. Default
TRUE
. If true counts how often one word 'beats' its similar other in regard to occurrences.- min_simil
Numerical (0-1). Default .7. Similarity threshold. Word pairs below this threshold are considered dissimilar.
- win_threshold
Numerical (0-1). Default .5. Determines the threshold to drop words, defined as proportion of won pairwise comparisons; resp., if both
compare_by
andcompare_hit
are set, the mean of both proportional wins.
Details
Detects similar words and multiword expressions in word_df
, using a fastText
model.
The cosine similarity threshold is set by min_simil
.
If requested, similar words are compared along a variable specified in compare_by
,
and/or by the number of occurrences stored in the variable named hits
.
The threshold for dropping terms is set by win_threshold
.
This indicates the proportion of how many of the pairwise comparisons are 'won'
by the word in question. If compare_hits = T
and a compare_by
is set,
it is the mean of both proportions. If a word 'wins' all comparisions
regarding frequency (wins_hits1 == 1),
but loses all comparisions regaring the set score ('wins_score1' == 0), this value is .5.
Forces 'word_df'
to contain a unique identifier called 'word_id';
re-uses the first variable named 'id' that is such a unique identifier.
Examples
model <- fastrtext::load_model(system.file("extdata",
"tw_demo_model_sml.bin",
package = "dictvectoR"))
set.seed(1)
word_df <- data.frame(words = c("unsere steuern",
"steuerzahler",
"unsere",
"steuern"),
hits = c(2, 3, 15, 4),
score = rnorm(4))
remove_similar_words(word_df,
model,
compare_by = "score",
compare_hits = FALSE,
win_threshold = .4)
#> words hits score word_id
#> 1 unsere steuern 2 -0.6264538 1
#> 2 steuern 4 1.5952808 4