Find multi-word expressions. — add_multiwords • dictvectoR

Adds multi-word expressions found in a quanteda tokens object to a data.frame of words.

Usage

add_multiwords(
  word_df,
  tokens,
  min_hits = 3,
  word_field = "words",
  levels = c(1, 2)
)

Arguments

word_df: A data.frame containing words.
tokens: A word-tokens object, returned by quanteda::tokens.
min_hits: Numerical. Default is 3. Minimum occurrence of the found multi-word expressions.
word_field: Character. Default is "words". Name of the column in word_df that contains the words.
levels: Numerical (1 or 2). The window size for multi-word expressions.

Value

A data.frame with all information from word_df, but with added multi-word expressions in rows. Additionally, the returned data.frame contains the columns...

'orig_id' with a unique identifier for each original word, re-used from word_df or created new if not present
'from' indicating the word of origin
'word_id (character) with a unique identifier for all words and multi-words
'hits indicating the number of occurrences of the word or multi-word

Examples

tw_data %<>% head(100) %>% clean_text(text_field = 'full_text')
#> Warning: undefined subclass "unpackedMatrix" of class "mMatrix"; definition not updated
#> Warning: undefined subclass "unpackedMatrix" of class "replValueSp"; definition not updated
toks <- quanteda::tokens(tw_data$text)
data.frame(words = c("deutschen", "millionen")) %>%
            add_multiwords(tokens = toks,
            min_hits = 1,
            levels = 1)
#> [1] "Finding multi-word expressions in window = 1..."
#> Joining, by = "from"
#> [1] "Adding missing count of original words:"
#> [1] "Counting word occurrences..."
#>                               words      from word_id orig_id hits
#> 3                     der deutschen deutschen   1_001       1    3
#> 1                         deutschen deutschen       1       1    6
#> 7             deutschen autokonzern deutschen   1_005       1    1
#> 9               deutschen bundestag deutschen   1_007       1    1
#> 11 deutschen kommissionspräsidentin deutschen   1_009       1    1
#> 12                deutschen politik deutschen   1_010       1    1
#> 10            deutschen unternehmen deutschen   1_008       1    1
#> 8              deutschen wirtschaft deutschen   1_006       1    1
#> 6                   einer deutschen deutschen   1_004       1    1
#> 5                      im deutschen deutschen   1_003       1    1
#> 4                  keinen deutschen deutschen   1_002       1    1
#> 2                         millionen millionen       2       2    3
#> 17                   millionen euro millionen   2_005       2    1
#> 16                millionen pendler millionen   2_004       2    1
#> 15                    millionen von millionen   2_003       2    1
#> 13                   sind millionen millionen   2_001       2    1
#> 14                vierzig millionen millionen   2_002       2    1