Helper function to prepare text for training a fastText model. (Caution: Tailored to texts in German language).
Arguments
- df
A data frame.
- text_field
Name of a column in that
df
that contains text that should be cleaned. Default is 'text'.- seed
Used for random shuffling. Default is 1.
Value
A character vector that can be directly used to train a fasttext model
with build_vectors
.
Details
Takes a data.frame containing text. First checks the length of
the texts specified in text_field
using nsentence
.
Texts with more than 3 sentences are tokenized by tokens
into sentences.
All texts are passed to clean_text
with the fixed settings:
tolower = T
remove_punct = T
replace_emojis = T
replace_numbers = T
remove_stopwords = F
store_uncleaned = F
count = T
The cleaned, short texts are shuffled and returned as a character vector.
Examples
texts <- prepare_train_data(head(tw_data, 10), text_field = 'full_text')