Azure Auto ML: Missing Validation Data and Model Comparison

The Importance of Validation Data for Azure Auto ML

Question

You are using Azure's Auto ML functionality to train models on your dataset containing around 15 000 observations.

The child runs need to validate the model by comparing the predictions made by model with labels in the validation data.

Therefore, the Auto ML needs to be provided with both training and validation data.

You provided the necessary training data, but no data for validation has been given.

What do you expect to happen?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: C.

Option A is incorrect because Auto ML applies default methods for validation in case no validation data is provided explicitly.

The method to be applied depends on the number of rows (observations) in the input dataset.

The Train/validation split method is used automatically if the dataset contains more than 20 000 rows.

Since your dataset has less than 20 000 rows, this option is incorrect.

Option B is incorrect becauseAuto ML applies default methods for validation in case no validation data is provided explicitly.

No exception occurs simply by the lack of explicit validation data.

Option C is CORRECT because Auto ML applies default methods for validation in case no validation data is provided explicitly.

The method to be applied depends on the number of rows (observations) in the training dataset.

If the dataset contains less than 20 000 rows, cross-validation method with the default number of folds (depending on the number of rows) is selected and used automatically, i.e.

parts of the original training data are used for cross-checking the performance of the runs.

Option D is incorrect because the log of the Run object is primarilyused to log metrics during experiment runs.

In addition, in this case no error occurs, which means there is no errors to log.

Reference:

If you are using Azure's Auto ML functionality to train models on a dataset containing around 15,000 observations and you have not provided any validation data, the Auto ML will automatically split the training data into training and validation sets.

Therefore, the correct answer is A. Train/validation split is applied automatically.

Azure Auto ML uses a technique called k-fold cross-validation to train and validate the model. In this technique, the training data is split into k subsets or folds, where k is usually set to 5 or 10. The Auto ML then trains the model on k-1 folds and validates it on the remaining fold. This process is repeated k times, with each fold serving as the validation set exactly once. The performance metrics are then averaged over the k-folds to provide a more accurate estimate of the model's performance.

If an exception were to occur, such as missing validation data, then option B. "A "Missing validation data" exception is thrown and the execution stops" may occur. However, if the Auto ML is not able to find any validation data, it will default to the train/validation split, as described in option A.

Option C. "Cross-validation is applied automatically" is partially correct, as cross-validation is indeed used by the Auto ML. However, the more accurate statement is that k-fold cross-validation is used.

Option D. "An error message is written to the log of the Run which can be retrieved by RunDetails().show()" is not correct, as there is no error message that would be generated in this scenario. The Auto ML would simply proceed with the train/validation split, as described in option A.