Designing and Implementing a Data Science Solution on Azure - Exam DP-100 - Microsoft

Using Azure Blob Storage to Train ML Models with 100 CSV Files

Prev Question Next Question

Question

You have 100 CSV files in Azure Blob Storage which you have to use to train your ML model.The files contain measurement data collected from manufacturing machines and have been collected in order to analyse causes of malfunctions.

Each row in the files is a snapshot of machine parameters at a given time.Using the ML Designer, you have to use the data in CSV files as input for your machine learning pipeline, ensuring reusability and versioning of data and minimizing the time to load during running experiments.What should you do?

Answers

A. Register the files as a File Dataset in your ML workspace; add the Dataset module to your pipeline

B. Add an Import Data module to your pipeline and configure it for accessing the files; set the Regenerate output = Yes

C. Register the files as a Tabular Dataset in your ML workspace; add the Dataset module to your pipeline

D. Add an ImportData module to your pipeline and configure it for accessing the files; set the Regenerate output = No.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: C.

Option A is incorrect because structured files (like CSVs) have to be registered as Tabular datasets, File type is not suitable for structured data; in addition, ML Designer supports only processing Tabular datasets.

Option B is incorrect because the Import Data module imports data directly, without registering a dataset in the ML workspace.

The Import Data module reloads a new set of data each time the pipeline runs, consuming excess resources.

Option C is CORRECT because the recommended practice for getting data into the ML pipeline without repeating the input operation for each run is registering the data as a Dataset.

Structured files (like CSVs) have to be registered as Tabular datasets.

The registered datasets can be found in the module palette, under Datasets and can be used like any other modules.

By having a dataset registered, additional features as versioning and data monitoring becomes available.

Option D is incorrect because the Import Data module imports data directly, without registering a dataset in the ML workspace, i.e.

the reusability requirement is not satisfied.

Reference:

The correct answer for this scenario is C: Register the files as a Tabular Dataset in your ML workspace; add the Dataset module to your pipeline.

Explanation: To use the data in CSV files as input for your machine learning pipeline, you need to register them as datasets in your Azure Machine Learning workspace. Dataset registration allows you to reuse data in multiple experiments, version data over time, and ensure data consistency across experiments.

In this scenario, the CSV files contain tabular data, so you should register them as Tabular Datasets. Tabular Datasets are ideal for structured data in various formats, including CSV, TSV, and Parquet. You can create a Tabular Dataset from a single file, multiple files, or a folder containing multiple files.

Once you have registered the Tabular Dataset, you can add the Dataset module to your machine learning pipeline. The Dataset module provides a way to reference the registered dataset in your pipeline. By using the Dataset module, you can ensure that your pipeline uses the correct version of the dataset.

Option A, Register the files as a File Dataset in your ML workspace; add the Dataset module to your pipeline, is not the best choice because File Datasets are used for unstructured data such as images, audio, and video. Since the data in the CSV files is structured, you should use Tabular Datasets instead.

Option B, Add an Import Data module to your pipeline and configure it for accessing the files; set the Regenerate output = Yes, is not the best choice because the Import Data module is used for ingesting data that is not already registered as a dataset. In this scenario, you have already registered the CSV files as a Tabular Dataset, so you should use the Dataset module to reference it.

Option D, Add an ImportData module to your pipeline and configure it for accessing the files; set the Regenerate output = No, is not the best choice because the Regenerate output option determines whether the Import Data module should re-import the data every time the pipeline is run. In this scenario, you want to reuse the registered dataset, so you should use the Dataset module instead.

Prev Question Next Question