Returns combinations of keywords.
Usage
get_combis(
word_df,
word_field = "words",
dims = NULL,
min_per_dim = 1,
max_overall,
limit = NULL,
seed = 1,
save_settings = T,
save_input = F
)
Arguments
- word_df
A data.frame containing a column with words or multi-word expressions.
- word_field
character. Default is
'words'
. The name of the column in word_df that contains the words.- dims
character. Default is
NULL
. The name of the column in word_df that groups the words into dimensions. Can be ignored.- min_per_dim
numerical. Default is
1
. Minimum number of words (per dimension) returned. Will be replaced by 1 if < 1.- max_overall
numerical. Maximum number of words per returned combination.
- limit
numerical. Default is
NULL
. Limits the number of combinations of equal length per dimension by randomization.- seed
numerical. Default is
1
. Input to make randomization reproducible.- save_settings
logical. Default is
TRUE
. Saves randomization settings in df to make it reproducible.- save_input
logical. Default is
FALSE
. Saves input words and dims in df as character string.
Value
A data.frame of combinations.
The combinations are stored as a list of character vectors in combs_split
.
Use this column to pass it to get_many_F1s
or get_many_RPFs
.
The data.frame will include string variables for the words in the combinations
called combs
for each dimension.
Do not use these columns passing the dictionaries on to get_many_F1s
,
as this will result in a
faulty average representation, caused by how representations for multi-word expressions
are queried.
Additionally, the data.frame includes a rowid, a count variable for the number of words
overall (sum_nterms
),
and counts for each dimension, which can be used to remove imbalanced dictionaries.
If requested, settings
stores the randomization settings,
and input
the words and dimensions used as input.
Details
Takes a data.frame word_df
with a character column specified by words
as input.
As default, it will return all combinations of various lengths of these words.
Additionally, the function can account for conceptual dimensions, identified by
a categorical (character, numerical, or factor)
column in word_df
specified by dims
. If dimensions are specified, the function
will find all combinations for each dimension, and
will return all combinations of these combinations. CAUTION: This can lead quickly
to an extremely large number of combinations.
The number of combinations can be limited in several ways:
Firstly, the minimum number of words returned per dimension is specified by min_per_dim
, the maximum number of words overall is set by max_overall
.
Secondly, the function allows for random sampling of combinations.
This is recommended, as it drastically reduces the number of returned combinations,
reduces the computational load, and speeds up the process. (Of course, this comes at
the cost of completeness).
Random sampling is implemented using the comboSample
function.
Setting a limit
will cap the number of combinations
of equal length for each dimension.
E.g., if limit = 5
, min_per_dim = 2
, max_overall = 6
is set for a word_df
containing two dimensions a
and b
,
the function will pick max. five combinations of length 2 for a
, and five of
length 2 for b
, five of length 3 for a
,
and five of length 3 for b
, and will return all combinations of these combinations.
Examples
test_df <- data.frame(words = letters[1:8],
dim = rep(paste0("c_", 1:2), 4))
t0 <- get_combis(test_df,
word_field = "words",
dims = "dim",
max_overall = 5)
#> Joining, by = "rowid"
t1 <- get_combis(test_df,
word_field = "words",
dims = "dim",
max_overall = 5,
limit = 5)
#> Joining, by = "rowid"