Detects similar words in a data.frame, using a fastText model.
If requested, compares similar words along a variable specified in compare_by
,
and by the number of occurrences stored in the variable named hits
.
Counts how often one word 'beats' its similar other in pairwise comparison.
This count is returned as 'wins_'.
Usage
detect_similar_words(
word_df,
model,
word_field = "words",
compare_by = NULL,
compare_hits = T,
min_simil = 0.7
)
Arguments
- word_df
A data.frame containing a column with words or multi-word expressions.
- model
A fastText model, loaded by
load_model
.- word_field
Character. The name of the column in word_df that contains the words.
- compare_by
Character. Default
NULL
. The name of a column that should be compared.- compare_hits
Logical. Default
TRUE
. If true counts how often one word 'beats' its similar other in regard to occurrences.- min_simil
Numerical (0-1). Default .7. Similarity threshold. Word pairs below this threshold are considered dissimilar.
Details
Forces the 'word_df' to have a unique identifying variable called 'word_id'; re-uses any variable named 'id' that is unique.
Examples
model <- fastrtext::load_model(system.file("extdata",
"tw_demo_model_sml.bin",
package = "dictvectoR"))
word_df <- data.frame(words = c("unsere steuern",
"steuerzahler",
"unsere",
"steuern"), hits = c(2, 3, 15, 4))
detect_similar_words(word_df, model)
#> Warning: undefined subclass "unpackedMatrix" of class "mMatrix"; definition not updated
#> Warning: undefined subclass "unpackedMatrix" of class "replValueSp"; definition not updated
#> simil word_id1 word1 hits1 word_id2 word2 hits2
#> 1 0.8014612 3 unsere 15 1 unsere steuern 2
#> 2 0.8014612 4 steuern 4 1 unsere steuern 2
#> 3 0.8445002 4 steuern 4 2 steuerzahler 3
#> 4 0.8014612 1 unsere steuern 2 3 unsere 15
#> 5 0.8014612 1 unsere steuern 2 4 steuern 4
#> 6 0.8445002 2 steuerzahler 3 4 steuern 4
#> wins_hits1
#> 1 1
#> 2 1
#> 3 1
#> 4 0
#> 5 0
#> 6 0