Prepare text for fastText-model-training

Helper function to prepare text for training a fastText model. (Caution: Tailored to texts in German language).

Usage

prepare_train_data(df, text_field = "text", seed = 1)

Arguments

df: A data frame.
text_field: Name of a column in that df that contains text that should be cleaned. Default is 'text'.
seed: Used for random shuffling. Default is 1.

Value

A character vector that can be directly used to train a fasttext model with build_vectors.

Details

Takes a data.frame containing text. First checks the length of the texts specified in text_field using nsentence. Texts with more than 3 sentences are tokenized by tokens into sentences. All texts are passed to clean_text with the fixed settings:

tolower = T
remove_punct = T
replace_emojis = T
replace_numbers = T
remove_stopwords = F
store_uncleaned = F
count = T

The cleaned, short texts are shuffled and returned as a character vector.

Examples

texts <- prepare_train_data(head(tw_data, 10), text_field = 'full_text')

Usage

Arguments

Value

Details

See also

Examples