Adds multi-word expressions found in a quanteda tokens object to a data.frame of words.
Usage
add_multiwords(
word_df,
tokens,
min_hits = 3,
word_field = "words",
levels = c(1, 2)
)
Arguments
- word_df
A data.frame containing words.
- tokens
A word-tokens object, returned by quanteda::tokens.
- min_hits
Numerical. Default is 3. Minimum occurrence of the found multi-word expressions.
- word_field
Character. Default is "words". Name of the column in
word_df
that contains the words.- levels
Numerical (1 or 2). The window size for multi-word expressions.
Value
A data.frame with all information from word_df
, but with added multi-word
expressions in rows.
Additionally, the returned data.frame contains the columns...
'orig_id'
with a unique identifier for each original word, re-used fromword_df
or created new if not present'from'
indicating the word of origin'word_id
(character) with a unique identifier for all words and multi-words'hits
indicating the number of occurrences of the word or multi-word
Examples
tw_data %<>% head(100) %>% clean_text(text_field = 'full_text')
#> Warning: undefined subclass "unpackedMatrix" of class "mMatrix"; definition not updated
#> Warning: undefined subclass "unpackedMatrix" of class "replValueSp"; definition not updated
toks <- quanteda::tokens(tw_data$text)
data.frame(words = c("deutschen", "millionen")) %>%
add_multiwords(tokens = toks,
min_hits = 1,
levels = 1)
#> [1] "Finding multi-word expressions in window = 1..."
#> Joining, by = "from"
#> [1] "Adding missing count of original words:"
#> [1] "Counting word occurrences..."
#> words from word_id orig_id hits
#> 3 der deutschen deutschen 1_001 1 3
#> 1 deutschen deutschen 1 1 6
#> 7 deutschen autokonzern deutschen 1_005 1 1
#> 9 deutschen bundestag deutschen 1_007 1 1
#> 11 deutschen kommissionspräsidentin deutschen 1_009 1 1
#> 12 deutschen politik deutschen 1_010 1 1
#> 10 deutschen unternehmen deutschen 1_008 1 1
#> 8 deutschen wirtschaft deutschen 1_006 1 1
#> 6 einer deutschen deutschen 1_004 1 1
#> 5 im deutschen deutschen 1_003 1 1
#> 4 keinen deutschen deutschen 1_002 1 1
#> 2 millionen millionen 2 2 3
#> 17 millionen euro millionen 2_005 2 1
#> 16 millionen pendler millionen 2_004 2 1
#> 15 millionen von millionen 2_003 2 1
#> 13 sind millionen millionen 2_001 2 1
#> 14 vierzig millionen millionen 2_002 2 1