This pages holds the details for the formula preprocessing blueprint. This is the blueprint used by default from mold() if x is a formula.

default_formula_blueprint(
  intercept = FALSE,
  allow_novel_levels = FALSE,
  indicators = TRUE
)

# S3 method for formula
mold(formula, data, ..., blueprint = NULL)

Arguments

intercept

A logical. Should an intercept be included in the processed data? This information is used by the process function in the mold and forge function list.

allow_novel_levels

A logical. Should novel factor levels be allowed at prediction time? This information is used by the clean function in the forge function list, and is passed on to scream().

indicators

A logical. Should factors be expanded into dummy variables?

formula

A formula specifying the predictors and the outcomes.

data

A data frame or matrix containing the outcomes and predictors.

...

Not used.

blueprint

A preprocessing blueprint. If left as NULL, then a default_formula_blueprint() is used.

Value

For default_formula_blueprint(), a formula blueprint.

Details

While not different from base R, the behavior of expanding factors into dummy variables when an intercept is not present should be documented.

  • When an intercept is present, factors are expanded into K-1 new columns, where K is the number of levels in the factor.

  • When an intercept is not present, factors are expanded into all K columns (one-hot encoding).

Offsets can be included in the formula method through the use of the inline function stats::offset(). These are returned as a tibble with 1 column named ".offset" in the $extras$offset slot of the return value.

Mold

When mold() is used with the default formula blueprint:

  • Predictors

    • The RHS of the formula is isolated, and converted to its own 1 sided formula: ~ RHS.

    • Runs stats::model.frame() on the RHS formula and uses data.

    • If indicators = TRUE, it then runs stats::model.matrix() on the result.

    • If indicators = FALSE, factors are removed before model.matrix() is run, and then added back afterwards. No interactions or inline functions involving factors are allowed.

    • If any offsets are present from using offset(), then they are extracted with model_offset().

    • If intercept = TRUE, adds an intercept column.

    • Coerces the result of the above steps to a tibble.

  • Outcomes

    • The LHS of the formula is isolated, and converted to its own 1 sided formula: ~ LHS.

    • Runs stats::model.frame() on the LHS formula and uses data.

    • Coerces the result of the above steps to a tibble.

Forge

When forge() is used with the default formula blueprint:

  • It calls shrink() to trim new_data to only the required columns and coerce new_data to a tibble.

  • It calls scream() to perform validation on the structure of the columns of new_data.

  • Predictors

    • It runs stats::model.frame() on new_data using the stored terms object corresponding to the predictors.

    • If, in the original mold() call, indicators = TRUE was set, it then runs stats::model.matrix() on the result.

    • If, in the original mold() call, indicators = FALSE was set, it runs stats::model.matrix() on the result without the factor columns, and then adds them on afterwards.

    • If any offsets are present from using offset() in the original call to mold(), then they are extracted with model_offset().

    • If intercept = TRUE in the original call to mold(), then an intercept column is added.

    • It coerces the result of the above steps to a tibble.

  • Outcomes

    • It runs stats::model.frame() on new_data using the stored terms object corresponding to the outcomes.

    • Coerces the result to a tibble.

Differences From Base R

There are a number of differences from base R regarding how formulas are processed by mold() that require some explanation.

Multivariate outcomes can be specified on the LHS using syntax that is similar to the RHS (i.e. outcome_1 + outcome_2 ~ predictors). If any complex calculations are done on the LHS and they return matrices (like stats::poly()), then those matrices are flattened into multiple columns of the tibble after the call to model.frame(). While this is possible, it is not recommended, and if a large amount of preprocessing is required on the outcomes, then you are better off using a recipes::recipe().

Global variables are not allowed in the formula. An error will be thrown if they are included. All terms in the formula should come from data.

By default, intercepts are not included in the predictor output from the formula. To include an intercept, set blueprint = default_formula_blueprint(intercept = TRUE). The rationale for this is that many packages either always require or never allow an intercept (for example, the earth package), and they do a large amount of extra work to keep the user from supplying one or removing it. This interface standardizes all of that flexibility in one place.

Examples

# --------------------------------------------------------------------------- # Setup train <- iris[1:100,] test <- iris[101:150,] # --------------------------------------------------------------------------- # Formula Example # Call mold() with the training data processed <- mold( log(Sepal.Width) ~ Sepal.Length + Species, train, blueprint = default_formula_blueprint(intercept = TRUE) ) # Then, call forge() with the blueprint and the test data # to have it preprocess the test data in the same way forge(test, processed$blueprint)
#> $predictors #> # A tibble: 50 x 4 #> `(Intercept)` Sepal.Length Speciesversicolor Speciesvirginica #> <dbl> <dbl> <dbl> <dbl> #> 1 1 6.3 0 1 #> 2 1 5.8 0 1 #> 3 1 7.1 0 1 #> 4 1 6.3 0 1 #> 5 1 6.5 0 1 #> 6 1 7.6 0 1 #> 7 1 4.9 0 1 #> 8 1 7.3 0 1 #> 9 1 6.7 0 1 #> 10 1 7.2 0 1 #> # … with 40 more rows #> #> $outcomes #> NULL #> #> $extras #> $extras$offset #> NULL #> #>
# Use `outcomes = TRUE` to also extract the preprocessed outcome forge(test, processed$blueprint, outcomes = TRUE)
#> $predictors #> # A tibble: 50 x 4 #> `(Intercept)` Sepal.Length Speciesversicolor Speciesvirginica #> <dbl> <dbl> <dbl> <dbl> #> 1 1 6.3 0 1 #> 2 1 5.8 0 1 #> 3 1 7.1 0 1 #> 4 1 6.3 0 1 #> 5 1 6.5 0 1 #> 6 1 7.6 0 1 #> 7 1 4.9 0 1 #> 8 1 7.3 0 1 #> 9 1 6.7 0 1 #> 10 1 7.2 0 1 #> # … with 40 more rows #> #> $outcomes #> # A tibble: 50 x 1 #> `log(Sepal.Width)` #> <dbl> #> 1 1.19 #> 2 0.993 #> 3 1.10 #> 4 1.06 #> 5 1.10 #> 6 1.10 #> 7 0.916 #> 8 1.06 #> 9 0.916 #> 10 1.28 #> # … with 40 more rows #> #> $extras #> $extras$offset #> NULL #> #>
# --------------------------------------------------------------------------- # Factors without an intercept # No intercept is added by default processed <- mold(Sepal.Width ~ Species, train) # So factor columns are completely expanded # into all `K` columns (the number of levels) processed$predictors
#> # A tibble: 100 x 3 #> Speciessetosa Speciesversicolor Speciesvirginica #> <dbl> <dbl> <dbl> #> 1 1 0 0 #> 2 1 0 0 #> 3 1 0 0 #> 4 1 0 0 #> 5 1 0 0 #> 6 1 0 0 #> 7 1 0 0 #> 8 1 0 0 #> 9 1 0 0 #> 10 1 0 0 #> # … with 90 more rows
# --------------------------------------------------------------------------- # Global variables y <- rep(1, times = nrow(train)) # In base R, global variables are allowed in a model formula frame <- model.frame(Species ~ y + Sepal.Length, train) head(frame)
#> Species y Sepal.Length #> 1 setosa 1 5.1 #> 2 setosa 1 4.9 #> 3 setosa 1 4.7 #> 4 setosa 1 4.6 #> 5 setosa 1 5.0 #> 6 setosa 1 5.4
# mold() does not allow them, and throws an error tryCatch( expr = mold(Species ~ y + Sepal.Length, train), error = function(e) print(e$message) )
#> The following predictors were not found in `data`: 'y'.
# --------------------------------------------------------------------------- # Dummy variables and interactions # By default, factor columns are expanded # and interactions are created, both by # calling model.matrix(). Some models (like # tree based models) can take factors directly # but still might want to use the formula method. # In those cases, set `indicators = FALSE` to not # run model.matrix() on factor columns. Interactions # are still allowed and are run on numeric columns. blueprint_no_indicators <- default_formula_blueprint(indicators = FALSE) processed <- mold( ~ Species + Sepal.Width:Sepal.Length, train, blueprint = blueprint_no_indicators ) processed$predictors
#> # A tibble: 100 x 2 #> `Sepal.Width:Sepal.Length` Species #> <dbl> <fct> #> 1 17.8 setosa #> 2 14.7 setosa #> 3 15.0 setosa #> 4 14.3 setosa #> 5 18 setosa #> 6 21.1 setosa #> 7 15.6 setosa #> 8 17 setosa #> 9 12.8 setosa #> 10 15.2 setosa #> # … with 90 more rows
# An informative error is thrown when `indicators = FALSE` and # factors are present in interaction terms or in inline functions try(mold(Sepal.Width ~ Sepal.Length:Species, train, blueprint = blueprint_no_indicators))
#> Error : Interaction terms involving factors have been detected on the RHS of `formula`. These are not allowed when `indicators = FALSE`. Interactions involving factors were detected for the following columns: 'Species'.
try(mold(Sepal.Width ~ paste0(Species), train, blueprint = blueprint_no_indicators))
#> Error : Functions involving factors have been detected on the RHS of `formula`. These are not allowed when `indicators = FALSE`. Functions involving factors were detected for the following columns: 'Species'.
# --------------------------------------------------------------------------- # Multivariate outcomes # Multivariate formulas can be specified easily processed <- mold(Sepal.Width + log(Sepal.Length) ~ Species, train) processed$outcomes
#> # A tibble: 100 x 2 #> Sepal.Width `log(Sepal.Length)` #> <dbl> <dbl> #> 1 3.5 1.63 #> 2 3 1.59 #> 3 3.2 1.55 #> 4 3.1 1.53 #> 5 3.6 1.61 #> 6 3.9 1.69 #> 7 3.4 1.53 #> 8 3.4 1.61 #> 9 2.9 1.48 #> 10 3.1 1.59 #> # … with 90 more rows
# Inline functions on the LHS are run, but any matrix # output is flattened (like what happens in `model.matrix()`) # (essentially this means you don't wind up with columns # in the tibble that are matrices) processed <- mold(poly(Sepal.Length, degree = 2) ~ Species, train) processed$outcomes
#> # A tibble: 100 x 2 #> `poly(Sepal.Length, degree = 2).1` `poly(Sepal.Length, degree = 2).2` #> <dbl> <dbl> #> 1 -0.0581 -0.0377 #> 2 -0.0894 0.0147 #> 3 -0.121 0.0846 #> 4 -0.136 0.126 #> 5 -0.0738 -0.0137 #> 6 -0.0111 -0.0837 #> 7 -0.136 0.126 #> 8 -0.0738 -0.0137 #> 9 -0.168 0.222 #> 10 -0.0894 0.0147 #> # … with 90 more rows
# TRUE ncol(processed$outcomes) == 2
#> [1] TRUE
# Multivariate formulas specified in mold() # carry over into forge() forge(test, processed$blueprint, outcomes = TRUE)
#> $predictors #> # A tibble: 50 x 3 #> Speciessetosa Speciesversicolor Speciesvirginica #> <dbl> <dbl> <dbl> #> 1 0 0 1 #> 2 0 0 1 #> 3 0 0 1 #> 4 0 0 1 #> 5 0 0 1 #> 6 0 0 1 #> 7 0 0 1 #> 8 0 0 1 #> 9 0 0 1 #> 10 0 0 1 #> # … with 40 more rows #> #> $outcomes #> # A tibble: 50 x 2 #> `poly(Sepal.Length, degree = 2).1` `poly(Sepal.Length, degree = 2).2` #> <dbl> <dbl> #> 1 0.130 0.0137 #> 2 0.0515 -0.0839 #> 3 0.255 0.397 #> 4 0.130 0.0137 #> 5 0.161 0.0833 #> 6 0.333 0.777 #> 7 -0.0894 0.0147 #> 8 0.286 0.536 #> 9 0.192 0.170 #> 10 0.271 0.464 #> # … with 40 more rows #> #> $extras #> $extras$offset #> NULL #> #>
# --------------------------------------------------------------------------- # Offsets # Offsets are handled specially in base R, so they deserve special # treatment here as well. You can add offsets using the inline function # offset() processed <- mold(Sepal.Width ~ offset(Sepal.Length) + Species, train) processed$extras$offset
#> # A tibble: 100 x 1 #> .offset #> <dbl> #> 1 5.1 #> 2 4.9 #> 3 4.7 #> 4 4.6 #> 5 5 #> 6 5.4 #> 7 4.6 #> 8 5 #> 9 4.4 #> 10 4.9 #> # … with 90 more rows
# Multiple offsets can be included, and they get added together processed <- mold( Sepal.Width ~ offset(Sepal.Length) + offset(Petal.Width), train ) identical( processed$extras$offset$.offset, train$Sepal.Length + train$Petal.Width )
#> [1] TRUE
# Forging test data will also require # and include the offset forge(test, processed$blueprint)
#> $predictors #> # A tibble: 50 x 0 #> #> $outcomes #> NULL #> #> $extras #> $extras$offset #> # A tibble: 50 x 1 #> .offset #> <dbl> #> 1 8.8 #> 2 7.7 #> 3 9.2 #> 4 8.1 #> 5 8.7 #> 6 9.7 #> 7 6.6 #> 8 9.1 #> 9 8.5 #> 10 9.7 #> # … with 40 more rows #> #>
# --------------------------------------------------------------------------- # Intercept only # Because `1` and `0` are intercept modifying terms, they are # not allowed in the formula and are controlled by the # `intercept` argument of the blueprint. To use an intercept # only formula, you should supply `NULL` on the RHS of the formula. mold(~ NULL, train, blueprint = default_formula_blueprint(intercept = TRUE))
#> $predictors #> # A tibble: 100 x 1 #> `(Intercept)` #> <dbl> #> 1 1 #> 2 1 #> 3 1 #> 4 1 #> 5 1 #> 6 1 #> 7 1 #> 8 1 #> 9 1 #> 10 1 #> # … with 90 more rows #> #> $outcomes #> # A tibble: 100 x 0 #> #> $blueprint #> Formula blueprint: #> #> # Predictors: 0 #> # Outcomes: 0 #> Intercept: TRUE #> Novel Levels: FALSE #> Indicators: TRUE #> #> $extras #> $extras$offset #> NULL #> #>