Detect similar words — detect_similar

Detects similar words in a data.frame, using a fastText model. If requested, compares similar words along a variable specified in compare_by, and by the number of occurrences stored in the variable named hits. Counts how often one word 'beats' its similar other in pairwise comparison. This count is returned as 'wins_'.

Usage

detect_similar_words(
  word_df,
  model,
  word_field = "words",
  compare_by = NULL,
  compare_hits = T,
  min_simil = 0.7
)

Arguments

word_df: A data.frame containing a column with words or multi-word expressions.
model: A fastText model, loaded by load_model.
word_field: Character. The name of the column in word_df that contains the words.
compare_by: Character. Default NULL. The name of a column that should be compared.
compare_hits: Logical. Default TRUE. If true counts how often one word 'beats' its similar other in regard to occurrences.
min_simil: Numerical (0-1). Default .7. Similarity threshold. Word pairs below this threshold are considered dissimilar.

Value

A data.frame. Containing a pairwise similarity table of all similar words.

Details

Forces the 'word_df' to have a unique identifying variable called 'word_id'; re-uses any variable named 'id' that is unique.

Examples

model <- fastrtext::load_model(system.file("extdata",
"tw_demo_model_sml.bin",
package = "dictvectoR"))
word_df <- data.frame(words = c("unsere steuern",
"steuerzahler",
 "unsere",
 "steuern"), hits = c(2, 3, 15, 4))
detect_similar_words(word_df, model)
#> Warning: undefined subclass "unpackedMatrix" of class "mMatrix"; definition not updated
#> Warning: undefined subclass "unpackedMatrix" of class "replValueSp"; definition not updated
#>       simil word_id1          word1 hits1 word_id2          word2 hits2
#> 1 0.8014612        3         unsere    15        1 unsere steuern     2
#> 2 0.8014612        4        steuern     4        1 unsere steuern     2
#> 3 0.8445002        4        steuern     4        2   steuerzahler     3
#> 4 0.8014612        1 unsere steuern     2        3         unsere    15
#> 5 0.8014612        1 unsere steuern     2        4        steuern     4
#> 6 0.8445002        2   steuerzahler     3        4        steuern     4
#>   wins_hits1
#> 1          1
#> 2          1
#> 3          1
#> 4          0
#> 5          0
#> 6          0