You are developing a ML pipeline using Python SDK, and you have to separate your data for training the model as well as for testing the trained model.
You got a code snippet from your teammate, which is a great help, if it works.
You have to check if it does the job.
By its description, the script loads data from the default datastore and separates the 70% of the observations for training and the rest of them for testing, by using the scikit-learn package, in a reproducible way.
from sklearn.model_selection import train_test_split # Get the experiment run context run = Run.get_context() # load data print("Loading Data...") diabetes_data = run.input_datasets['diabetes_train'].to_pandas_dataframe() # Separate features and labels X, y = diabetes_data[['Pregnancies','PlasmaGlucose', 'DiastolicBloodPressure','BMI','Age']].values, diabetes_data['Diabetic'].values # Split data into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)After reviewing the code, do you think it does its job as described?
Click on the arrows to vote for the correct answer
A. B.Answer: A.
Option A is CORRECT because the code is correct.
It does its job exactly as it is described.
Option B is incorrect because the code is correct.
Reference:
Based on the code snippet provided, it seems that the script does separate the data for training and testing as described.
Here's a detailed explanation of the code:
The script imports the necessary packages including "train_test_split" from "sklearn.model_selection" to split the data into training and testing sets.
The script then gets the experiment run context using "Run.get_context()" method.
The script loads the diabetes_train dataset from the default datastore by using "run.input_datasets[diabetes_train
].to_pandas_dataframe()" method, and converts it into a pandas dataframe.
The script separates the features and labels by using "X, y = diabetes_data[[Pregnancies
,PlasmaGlucose
,DiastolicBloodPressure
,BMI
,Age
]].values, diabetes_data[Diabetic
].values" code line.
Finally, the script splits the data into training and test sets by using "train_test_split(X, y, test_size=0.30, random_state=0)" code line, where 70% of the data is allocated for training, and 30% for testing, and the random state is set to 0 for reproducibility.
Therefore, based on the above analysis, we can conclude that the code does its job as described.