What is Training, Validation and Test Data Set

When you start working with data science and engineering you will hear three terms of sample data 1) Training Data Set, 2) Validation Data Set, 3) Test Data Set. Let’s see what are these three terms:

Training Data Set: Generally training data set is a sample of historic data set which is used to build up predictive model using one or more algorithms with the help of different machine learning software (i.e. R, Python, SAS, Stata, SPSS etc.). Multiple algorithms help to tune the model during the Validation Phase. Each type of algorithm has its own parameter options (the number of layers in a Neural Network, the number of trees in a Random Forest, etc). Generally 60-70% of the original data set is taken as a training data set.

Validation Data Set: Generally validation data set is a sample of historic data set which is used to compare the performances of the predictive model that were created based on the training data set. Generally we must pick one algorithm from a collection of algorithms that performs best on the validation. Generally rest  40-30% of the original data set is taken as a validation data set.

Test Data Set: Generally test data set is a sample of current data set which is used to measure performances model during the piloting period. During the test phase, the purpose is to see how our final model is going to deal in the real world, so in case its performance is very poor we should repeat the whole process starting from the training phase. Generally 15-30% of the original data set is taken as a test data set.

Some scientists may use only training & validation data set and go live without piloting. Decision is yours. If you are confident enough and if there is no risk factor then you can go live after validation phase. But I would suggest you to do piloting before going to implement the model among the current targeted data set.

 

Add a Comment