For your machine learning experiments, you need to get CSV data files from a web location and you need to use them as a dataset in your ML workspace.
There are ten files to be imported and each of them contain different columns of a large table.
Above the column header, each file has 6 rows containing unstructured data like dates, separator lines etc.
You want to use ML Studio to complete the work.
Beside others, you should set the following options: Dataset type: Column headers: Skip rows / Skip n rows: Which combination of settings should you use?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: C.
Option A is incorrect because File datasets are designed for unstructured training data, like images etc.
For CVS sources, Tabular type should be selected.
Option B is incorrect because column headers from all files must be selected because, as it is stated, all files hold different columns of the whole data structure.
Option C is CORRECT because for structured data files, tabular dataset should be defined and, since the files contain vertical slices of a large table, column headers from all files have to be combined.
The first relevant row (the column header) is located in row 7, i.e.
Skip rows setting is 6.
Option D is incorrect because File type datasets are used for unstructured data, and column headers from all files must be used.
Diagram - ML Studio.
Reference:
The correct combination of settings to use in order to import CSV data files from a web location and use them as a dataset in a machine learning workspace in Azure ML Studio, while skipping the unstructured data above the column headers is:
C. 1 - Tabular; 2 - Combine headers from all files; 3 - From all files / 6.
Explanation:
Dataset Type: The first option asks for the type of the dataset. There are two options available: File and Tabular. Since we are working with CSV files, the correct option is Tabular.
Column Headers: The second option is about the column headers. There are two options: All files have the same headers, and Combine headers from all files. Since each file contains different columns of a large table, we need to use the Combine headers from all files option.
Skip Rows: The third option is about skipping rows from the top of the files. We need to skip the first 6 rows of each file, since they contain unstructured data like dates and separator lines. The correct option is From all files / 6.
Therefore, the correct combination of settings is: 1 - Tabular; 2 - Combine headers from all files; 3 - From all files / 6.
Option A is incorrect because it uses the wrong dataset type (File instead of Tabular). Option B is incorrect because it assumes that all files have the same headers, which is not the case. Option D is incorrect because it skips 5 rows only from the first file, while we need to skip 6 rows from all files.