# A tibble: 2 × 2
No Yes
<dbl> <dbl>
1 0.843 0.157
2 0.830 0.170

In case the variable response has imbalance, the split process should use stratify, this helps to keep distribution of the response variable in the splitted data.

# A tibble: 2 × 2
No Yes
<dbl> <dbl>
1 0.839 0.161
2 0.837 0.163

“Down-sampling balances the dataset by reducing the size of the abundant class(es) to match the frequencies in the least prevalent class. This method is used when the quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class.”

“On the contrary, up-sampling is used when the quantity of data is insufficient. It tries to balance the dataset by increasing the size of rarer samples. Rather than getting rid of abundant samples, new rare samples are generated by using repetition or bootstrapping”

Train Data

Train Data “used to develop feature sets, train our algorithms, tune hyperparameters, compare models, and all of the other activities required to choose a final model (e.g., the model we want to put into production).”

flowchart LR
id1[(DataBase)] --> A((Train))
subgraph Training
direction TB
subgraph Resampling
B[resample 1]
C[resample 2]
D[resample 3]
end
subgraph Model_1
E[Develop] --> F[Evaluate]
G[Develop] --> H[Evaluate]
I[Develop] --> J[Evaluate]
end
subgraph Model_2
K[Develop] --> L[Evaluate]
M[Develop] --> N[Evaluate]
O[Develop] --> P[Evaluate]
end
subgraph Model_n
Q[Develop] --> R[Evaluate]
S[Develop] --> T[Evaluate]
U[Develop] --> V[Evaluate]
end
end
A -- Data into samples --> Resampling
Resampling -- Create --> Model_1
Model_1 -- Tune --> Model_2
Model_2 -- Tune --> Model_n

Once the best model is selected it is time to test the model with the test data. Training (60% - 80%) and Testing (40% - 20%). It’s importante to not pass this limits because you can fall in a overfitting.

Test data

Test data: “having chosen a final model, these data are used to estimate an unbiased assessment of the model’s performance, which we refer to as the generalization error.”

2. Modelling in R

There are different ways to create a formulas depending on the engine used. In order to test the model, we should not use the test data, instead, training data should be splitied using resampling methods,

3. Resampling methods

“Provide an alternative approach by allowing us to repeatedly fit a model of interest to parts of the training data and test its performance on other parts. The two most commonly used resampling methods include k-fold cross validation and bootstrapping.”

Principal idea of k-fold where the training data is divided into training samples and one testing sample, so you can test within the fold created. This procedure is repeated k times. In practices, k = 5 or k = 10 is common.

“Although using k ≥ 10 helps to minimize the variability in the estimated performance, k-fold CV still tends to have higher variability than bootstrapping (discussed next). Kim (2009) showed that repeating k-fold CV can help to increase the precision of the estimated generalization error. Consequently, for smaller data sets (say n<10,000, 10-fold CV repeated 5 or 10 times will improve the accuracy of your estimated performance and also provide an estimate of its variability.”

Random samples of the data with replacement “Since observations are replicated in bootstrapping, there tends to be less variability in the error measure compared with k-fold CV (Efron 1983). However, this can also increase the bias of your error estimate. This can be problematic with smaller data sets; however, for most average-to-large data sets (say n≥1,000) this concern is often negligible.”

I will be using recipes packages from tidymodels framework.

1. Target Engineering

Some models, for example, parametrics ones. Assumes that their response variable and the error are normally distributed. Therefore, it is important to review distribution before start modelling, this might improve the prediction.

One way to correct not normally distribution is with the log or BoxCox function. “However, we should think of the preprocessing as creating a blueprint to be re-applied strategically. For this, you can use the recipe package or something similar (e.g., caret::preProcess()). This will not return the actual log transformed values but, rather, a blueprint to be applied later.”

# Log transformation applied to all outcomesames_recipe <-recipe(Sale_Price ~ ., data = ames_train) %>%step_log(all_outcomes()) # OR# step_BoxCox(all_outcomes())

In case the response variable has negatives, the previous approach might conduct to NAs values, then, step_YeoJohnson() can be applied.

2. Dealing with missingness

I strongly recommend to use naniar package to check missings values in the df. naniar::vis_miss()

Some missing values might be an error caused by the construction of the data, so, this requires to analyse. In case, Data is well built, imputation values can be used. Please check the following:

3. Feature filtering

Important

For some models, increasing features not always make the output better, instead, it can affect the processing time and cost of computation.

The following images are taken from the book referenced at the beginning of the blog. On the left, it shows the performance metric vs # features and on the right, it shows the processing time taken to train a model.

Zero and near-zero variance variables are target to eliminate as features! Meaning the feature only has a single value or not useful information to the model

In order to remove zero or near-zero variables, use the following functions from recipes packages:

ames_recipe %>% recipes::step_nzv() %>%# near zero variance elimiate recipes::step_zv() # zero variance eliminate

In order to correct Skewness, normalize. Use BoxCox for positive features, in case of negative features use YeoJohnson.

ames_recipe %>% recipes::step_BoxCox() #positive features# recipes::step_YeoJohnson() #include negative features

“Standarization includes centering and scaling so that numeric variables have zero mean and unit variance, which provides a common comparable unit of measure across all the variables”

“Models that incorporate smooth functions of input features are sensitive to the scale of the inputs. Many algorithms use linear functions within their algorithms, some more obvious (e.g., GLMs and regularized regression) than others (e.g., neural networks, support vector machines, and principal components analysis). Other examples include algorithms that use distance measures such as the Euclidean distance (e.g., k nearest neighbor, k-means clustering, and hierarchical clustering).”

Important

However, you should standardize your variables within the recipe blueprint so that both training and test data standardization are based on the same mean and variance. This helps to minimize data leakage

Some models requires all features to be numerical.

Lumping

In some cases, there are some levels of a categorical variable that have few observations, therefore, we can agrupated them in one level with step_other. However, lumping should be used sparingly as there is often a loss in model performance (Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.).

One-hot & dummy encoding

A categorical column could be converted to a set of binaries variables. However, some models, such as, ordinary linear regression and neural networks, might have problems with collinearity (collinearity, in statistics, correlation between predictor variables (or independent variables), such that they express a linear relationship in a regression model. When predictor variables in the same regression model are correlated, they cannot independently predict the value of the dependent variable). Therefore, dummy step would remove one binary variable to not create a collinearity.