What is Cross Validation in Machine Learning?



What is Cross Validation?

Cross validation is a machine learning (statistical) technique, aiming to evaluate how well a model generalises to unseen (new) data. The basic idea is to split the dataset into multiple subsets or folds. The model is trained on some of these folds and then evaluated on the remaining fold(s). This process is repeated multiple times, with each fold serving as the evaluation set exactly once, while the rest of the folds are used for training.

Why implement Cross Validation?

  • More Reliable Performance Estimation
    By repeatedly splitting the dataset into training and validation sets and evaluating the model’s performance across multiple iterations, the technique provides a more reliable estimate of how well the model will generalise to unseen data compared to a single train-test split. This reduces the risk of overfitting to a specific subset of data.
  • Utilisation of Available Data
    The statistical method allows you to make the most efficient use of your available data. Instead of just partitioning it into one training set and one validation set, you use the entire dataset for both training and validation across multiple iterations. This can be particularly advantageous when working with limited data.
  • Parameter Tuning and Model Selection
    Commonly used for hyperparameter tuning and model selection. By evaluating different hyperparameters or even different models, you can choose the best performing model configuration.
  • Detection of Data Drift and Model Instability
    By repeatedly evaluating the model’s performance on different subsets of data, you detect issues such as data drift or model instability over time. If the model’s performance varies significantly across different folds, it may indicate such issues.
  • Bias and Variance Analysis
    Cross validation can also help in understanding the bias-variance tradeoff of the model. By analysing the performance across different folds, you can gain insights into whether the model is underfitting (high bias) or overfitting (high variance) the data.

Common types of cross-validation

There are multiple types of cross validation. Each type has different characteristics and implementation. We cover briefly some of the most common types here.

K-Fold Cross Validation

We divide the dataset into k subsets of equal size. The model is trains k times, each time using k-1 folds for training and the remaining fold for validation.

Leave-One-Out Cross Validation (LOOCV)

Each data point serves as a validation set once, and the model is trained on all other data points. This is essentially the k-fold type with k equal to the number of data points.

Stratified K-Fold Cross Validation

This method ensures that each fold has approximately the same class distribution as the whole dataset, which is particularly useful for imbalanced datasets where certain classes are underrepresented.

Time Series Cross Validation

For time series data, it’s important to maintain the temporal order of the data. In this approach, each fold contains contiguous segments of time.

Example

We have a dataset of students’ exam scores. The goal is to build a machine learning model that can predict whether a student will pass or fail. The prediction is on the basis of previous exam scores. The features (independent variables) are the exam scores and the target (dependent variable) is the fail/pass outcome.

  1. Data preparation – first. we load the dataset. Then we split the features and target variable.
  2. Model selection – we choose the appropriate ML model (e.g. logistic regression).
  3. Validation setup – we choose the validation type (e.g. 5 fold). As a result, we split the dataset into 5 equal parts.
  4. Training and evaluation – we split the dataset into 4 parts training set and 1 part validation set. Then, we train the model on the training data. And finally, we evaluate the performance on the validation set (e.g. accuracy, precision, recall, F1, etc.). We repeat this whole step for each fold.
  5. Performance aggregation – in the end, we aggregate the model’s performance of all folds. In other words, we take the average of all folds and estimate the model’s overall performance.

We will be happy to hear your thoughts

Leave a reply

0
Your Cart is empty!

It looks like you haven't added any items to your cart yet.

Browse Products
Powered by Caddy
Shopping cart