Skip to contents

Helper function to prepare text for training a fastText model. (Caution: Tailored to texts in German language).

Usage

prepare_train_data(df, text_field = "text", seed = 1)

Arguments

df

A data frame.

text_field

Name of a column in that df that contains text that should be cleaned. Default is 'text'.

seed

Used for random shuffling. Default is 1.

Value

A character vector that can be directly used to train a fasttext model with build_vectors.

Details

Takes a data.frame containing text. First checks the length of the texts specified in text_field using nsentence. Texts with more than 3 sentences are tokenized by tokens into sentences. All texts are passed to clean_text with the fixed settings:

  • tolower = T

  • remove_punct = T

  • replace_emojis = T

  • replace_numbers = T

  • remove_stopwords = F

  • store_uncleaned = F

  • count = T

The cleaned, short texts are shuffled and returned as a character vector.

Examples

texts <- prepare_train_data(head(tw_data, 10), text_field = 'full_text')