Skip to contents

Returns combinations of keywords.

Usage

get_combis(
  word_df,
  word_field = "words",
  dims = NULL,
  min_per_dim = 1,
  max_overall,
  limit = NULL,
  seed = 1,
  save_settings = T,
  save_input = F
)

Arguments

word_df

A data.frame containing a column with words or multi-word expressions.

word_field

character. Default is 'words'. The name of the column in word_df that contains the words.

dims

character. Default is NULL. The name of the column in word_df that groups the words into dimensions. Can be ignored.

min_per_dim

numerical. Default is 1. Minimum number of words (per dimension) returned. Will be replaced by 1 if < 1.

max_overall

numerical. Maximum number of words per returned combination.

limit

numerical. Default is NULL. Limits the number of combinations of equal length per dimension by randomization.

seed

numerical. Default is 1. Input to make randomization reproducible.

save_settings

logical. Default is TRUE. Saves randomization settings in df to make it reproducible.

save_input

logical. Default is FALSE. Saves input words and dims in df as character string.

Value

A data.frame of combinations. The combinations are stored as a list of character vectors in combs_split. Use this column to pass it to get_many_F1s or get_many_RPFs.

The data.frame will include string variables for the words in the combinations called combs for each dimension. Do not use these columns passing the dictionaries on to get_many_F1s, as this will result in a faulty average representation, caused by how representations for multi-word expressions are queried.

Additionally, the data.frame includes a rowid, a count variable for the number of words overall (sum_nterms), and counts for each dimension, which can be used to remove imbalanced dictionaries. If requested, settings stores the randomization settings, and input the words and dimensions used as input.

Details

Takes a data.frame word_df with a character column specified by words as input. As default, it will return all combinations of various lengths of these words. Additionally, the function can account for conceptual dimensions, identified by a categorical (character, numerical, or factor) column in word_df specified by dims. If dimensions are specified, the function will find all combinations for each dimension, and will return all combinations of these combinations. CAUTION: This can lead quickly to an extremely large number of combinations.

The number of combinations can be limited in several ways:

Firstly, the minimum number of words returned per dimension is specified by min_per_dim, the maximum number of words overall is set by max_overall.

Secondly, the function allows for random sampling of combinations. This is recommended, as it drastically reduces the number of returned combinations, reduces the computational load, and speeds up the process. (Of course, this comes at the cost of completeness). Random sampling is implemented using the comboSample function. Setting a limit will cap the number of combinations of equal length for each dimension.

E.g., if limit = 5, min_per_dim = 2, max_overall = 6 is set for a word_df containing two dimensions a and b, the function will pick max. five combinations of length 2 for a, and five of length 2 for b, five of length 3 for a, and five of length 3 for b, and will return all combinations of these combinations.

Examples

test_df <- data.frame(words = letters[1:8],
                      dim = rep(paste0("c_", 1:2), 4))
t0 <- get_combis(test_df,
                 word_field = "words",
                 dims = "dim",
                 max_overall = 5)
#> Joining, by = "rowid"
t1 <- get_combis(test_df,
                 word_field = "words",
                 dims = "dim",
                 max_overall = 5,
                 limit = 5)
#> Joining, by = "rowid"