Convert CSV to Parquet for Efficient ML Processing | DP-100 Exam Question Solution

Optimizing ML Experiments: Converting CSV to Parquet Format

Question

For your ML experiments, you need to process CSV data files.

Size of your files is about 10GB each.

Your training script loads the ingested data to a pandas dataframe object.

During the runs, you get an “Out of memory” error.

You decide to convert the files to Parquet format and process it partially, i.e.

loading only the columns relevant from the modelling point of view.

Does it solve the problem?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B.

Answer: A.

Option A is CORRECT because the data loaded from a CSV file can expand significantly when loaded into a dataframe in memory.

Converting it to the columnar Parquet format is a viable solution because it enables loading selected columns which are necessary for the training process.

Option B is incorrect because using the columnar Parquet format instead of CSV can be used to optimize memory consumption, therefore it is a good solution.

Reference:

The proposed solution of converting the CSV files to Parquet format and processing it partially by loading only relevant columns may help to solve the “Out of memory” error. However, there are a few things to consider before concluding that this solution is the right fit:

  1. Parquet format: Converting CSV files to Parquet format can help reduce the file size and improve query performance, as it is a columnar storage format that compresses data and stores it efficiently. However, it is essential to ensure that the conversion process is done correctly, and the resulting Parquet files are in the correct format and can be read by your training script.

  2. Relevant columns: Loading only relevant columns can help to reduce the memory footprint and avoid the “Out of memory” error. However, it is crucial to ensure that you are still loading enough data for the training script to produce accurate models.

  3. Partial processing: Processing data partially can help to reduce the amount of memory required for the training script. However, it is essential to ensure that you are processing enough data to produce accurate models and not sacrificing too much data to reduce memory requirements.

  4. Hardware: The hardware being used for the training script can also impact whether or not the proposed solution will work. If the hardware does not have enough memory to handle the data, then the proposed solution may not be sufficient to solve the “Out of memory” error.

In conclusion, the proposed solution of converting CSV files to Parquet format and processing relevant columns partially can help to solve the “Out of memory” error. However, it is crucial to ensure that the conversion process is done correctly, and enough data is still being loaded to produce accurate models. Additionally, the hardware being used should have enough memory to handle the data.