Helper function to prepare text for training a fastText model. (Caution: Tailored to texts in German language).
Arguments
- df
A data frame.
- text_field
Name of a column in that
dfthat contains text that should be cleaned. Default is 'text'.- seed
Used for random shuffling. Default is 1.
Value
A character vector that can be directly used to train a fasttext model
with build_vectors.
Details
Takes a data.frame containing text. First checks the length of
the texts specified in text_field using nsentence.
Texts with more than 3 sentences are tokenized by tokens
into sentences.
All texts are passed to clean_text with the fixed settings:
tolower = Tremove_punct = Treplace_emojis = Treplace_numbers = Tremove_stopwords = Fstore_uncleaned = Fcount = T
The cleaned, short texts are shuffled and returned as a character vector.
Examples
texts <- prepare_train_data(head(tw_data, 10), text_field = 'full_text')
