Separating Data for Training and Testing in ML Pipeline using Python SDK | Exam DP-100

Separating Data for Training and Testing in ML Pipeline

Question

You are developing a ML pipeline using Python SDK, and you have to separate your data for training the model as well as for testing the trained model.

You got a code snippet from your teammate, which is a great help, if it works.

You have to check if it does the job.

By its description, the script loads data from the default datastore and separates the 70% of the observations for training and the rest of them for testing, by using the scikit-learn package, in a reproducible way.

from sklearn.model_selection import train_test_split # Get the experiment run context run = Run.get_context() # load data print("Loading Data...") diabetes_data = run.input_datasets['diabetes_train'].to_pandas_dataframe() # Separate features and labels X, y = diabetes_data[['Pregnancies','PlasmaGlucose', 'DiastolicBloodPressure','BMI','Age']].values,  diabetes_data['Diabetic'].values # Split data into training set and test set X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.30, random_state=0) 
After reviewing the code, do you think it does its job as described?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B.

Answer: A.

Option A is CORRECT because the code is correct.

It does its job exactly as it is described.

Option B is incorrect because the code is correct.

Reference:

Based on the given code snippet, it appears that the code is designed to load data from a default datastore and separate 70% of the observations for training and the remaining 30% for testing. It is using the scikit-learn package to split the data in a reproducible way.

The code imports the train_test_split function from the scikit-learn package, which is commonly used for splitting datasets for machine learning. Then, the code loads the diabetes data from a default datastore and separates the features and labels. Next, it uses the train_test_split function to split the data into training and test sets, with a test size of 30% and a random state of 0.

Based on the description and code provided, it seems that the code does indeed split the data into training and test sets as described. The random_state parameter ensures reproducibility of the split, and the test_size parameter controls the proportion of the data used for testing.

Therefore, the answer is A. Yes, the code does its job as described.