Designing and Implementing a Data Science Solution on Azure: Exam DP-100 - Question Answered

Step-by-Step Guide for Implementing ML Classification on Azure with SDK

Question

You are working for a medical center where patients are being tested five days a week and their data, including the test results is collected in multiple CSV files.

Data collected during the week should be fed into a ML model for classification, in order to determine which patients are at risk of COVID-19 infection.

Your task is to implement this process using the SDK, using the following steps: Upload CSV files and register them as a file dataset Create ParallelRunStep object Reference input dataset in the ParallelRunConfig object Reference dataset in the ParallelRunStep object Use ParallelRunStep object in a Pipeline Upload CSV files and register them as a tabular dataset Create ParallelRunConfig object Which of the steps above should you include in what logical order?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: B.

Option A is incorrect because reference to the dataset (input data) must be added to the ParallelRunStep object, not to the ParallelRunConfig.

Option B is CORRECT because to process multiple data files for inference in batch mode, the ParallelRunStep can be used for working with data in parallel.

Parameters of parallel processing can be set via the ParallelRunConfig object.

While ParallelRunStep can either use Tabular or File dataset, in this case the File type is the right choice.

Option C is incorrect because while working with a large number of data files, File dataset is the practical choice.

In addition, creating the ParallelRunConfig must precede the creation of the ParallelRunStep.

Option D is incorrect because reference to the dataset (input data) must be added to the ParallelRunStep object.

Reference:

The task at hand is to implement a data processing pipeline to classify patients based on COVID-19 risk using multiple CSV files containing patient data. The pipeline should be implemented using the Azure SDK. The steps that need to be taken to implement the pipeline are listed below, along with their order.

  1. Upload CSV files and register them as a file dataset: The first step is to upload the CSV files to a storage account and register them as a file dataset. This will enable the pipeline to access the data during processing. This step is listed as the first step in all answer choices.

  2. Create a ParallelRunConfig object: A ParallelRunConfig object is a configuration object that defines the compute and other settings for the parallel run. This object is created using the Azure SDK. This step is listed as the second step in all answer choices.

  3. Reference input dataset in the ParallelRunConfig object: The input dataset is the file dataset that was registered in step 1. This step involves referencing the input dataset in the ParallelRunConfig object created in step 2. This step is listed as the third step in all answer choices.

  4. Reference dataset in the ParallelRunStep object: A ParallelRunStep object is a step in the pipeline that performs the parallel processing. This step involves referencing the dataset (input and/or output) in the ParallelRunStep object created in step 2. This step is listed as the fourth step in answer choice B and is not present in any other answer choices.

  5. Use ParallelRunStep object in a Pipeline: After creating the ParallelRunStep object, it needs to be used in a pipeline to perform the parallel processing. This step is listed as the fifth step in all answer choices.

  6. Upload CSV files and register them as a tabular dataset: This step involves uploading the CSV files and registering them as a tabular dataset instead of a file dataset. This step is not present in any of the answer choices.

Based on the above analysis, we can conclude that the correct answer is option D - 1, 7, 2, 5. This order includes all the necessary steps in the correct sequence to implement the pipeline using the Azure SDK.