Separating Data for Training and Testing in ML Pipeline using Python SDK | Exam DP-100

Separating Data for Training and Testing in ML Pipeline

Prev Question Next Question

Question

You are developing a ML pipeline using Python SDK, and you have to separate your data for training the model as well as for testing the trained model.

You got a code snippet from your teammate, which is a great help, if it works.

You have to check if it does the job.

By its description, the script loads data from the default datastore and separates the 70% of the observations for training and the rest of them for testing, by using the scikit-learn package, in a reproducible way.

from sklearn.model_selection import train_test_split # Get the experiment run context run = Run.get_context() # load data print("Loading Data...") diabetes_data = run.input_datasets['diabetes_train'].to_pandas_dataframe() # Separate features and labels X, y = diabetes_data[['Pregnancies','PlasmaGlucose', 'DiastolicBloodPressure','BMI','Age']].values,  diabetes_data['Diabetic'].values # Split data into training set and test set X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.30, random_state=0)

After reviewing the code, do you think it does its job as described?

Answers

A. Yes

B. No.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B.

Answer: A.

Option A is CORRECT because the code is correct.

It does its job exactly as it is described.

Option B is incorrect because the code is correct.

Reference:

Based on the given code snippet, it appears that the code is designed to load data from a default datastore and separate 70% of the observations for training and the remaining 30% for testing. It is using the scikit-learn package to split the data in a reproducible way.

The code imports the train_test_split function from the scikit-learn package, which is commonly used for splitting datasets for machine learning. Then, the code loads the diabetes data from a default datastore and separates the features and labels. Next, it uses the train_test_split function to split the data into training and test sets, with a test size of 30% and a random state of 0.

Based on the description and code provided, it seems that the code does indeed split the data into training and test sets as described. The random_state parameter ensures reproducibility of the split, and the test_size parameter controls the proportion of the data used for testing.

Therefore, the answer is A. Yes, the code does its job as described.

Prev Question Next Question