You work as a machine learning specialist for a financial services firm.
Your firm contracts with market data generation services that deliver 5 TB of market activity record data every minute.
To prepare this data for your machine learning models, your team queries the data using Athena.
However, the queries perform poorly because they are operating on such a large data stream.
You need to find a more performant option.
Which file format for your market data records on S3 will give you the best performance?
Click on the arrows to vote for the correct answer
A. B. C. D.Correct Answer: C.
Option A is incorrect.
The TSV file format uses a row-based file structure that uses tabs as an attribute separator.
When Athena reads from these types of files, it must read the entire row for every row versus reading in a column when only the attribute in that column is needed for your query.
Columnar-based file processing is much more efficient for queries of large datasets.
Also, the TSV file format does not support the partitioning of your data.
Option B is incorrect.
Compressed LZO Files do not support columnar processing nor partitioning.
Therefore they will perform poorly when compared to columnar file formats like Parquet.
Option C is correct.
The Parquet file format is a columnar-based format, and it supports partitioning.
The other columnar-based file format supported by Athena is ORC.
These columnar-based file formats outperform the tabular formats such as CSV and TSV when Athena works with very large datasets.
Option D is incorrect.
The CSV file format uses a row-based file structure that uses commas as an attribute separator.
When Athena reads from these types of files, it must read the entire row for every row versus reading in a column (columnar-based processing) when only the attribute in that column is needed for your query.
Columnar-based file processing is much more efficient for queries of large datasets.
Also, the CSV file format does not support the partitioning of your data.
References:
Please see the Amazon Athena FAQs (refer to the question “How do I improve the performance of my query?”) (https://aws.amazon.com/athena/faqs/#:~:text=Amazon%20Athena%20supports%20a%20wide,%2C%20LZO%2C%20and%20GZIP%20formats.),
The AWS Big Data blog titled Top 10 Performance Tuning Tips for Amazon Athena (https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/),
The Amazon Athena user guide titled Compression Formats (https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html)
Given the scenario where the financial services firm deals with a large amount of market activity record data every minute, the team queries the data using Athena. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. However, the queries are performing poorly due to the large data stream. Therefore, the team needs to find a more performant option to prepare the data for their machine learning models.
One way to address this issue is by using a more performant file format for the market data records on S3. Different file formats have different performance characteristics, and the choice of file format can impact the query performance significantly.
Among the given options, Parquet files are generally considered to provide the best performance for big data analytics use cases.
Parquet is a columnar storage format that is optimized for large-scale data processing. It is designed to store and process large amounts of data efficiently and supports advanced features like compression, encoding, and predicate pushdown.
Compared to row-based file formats like CSV and TSV, Parquet is more efficient because it only reads the columns that are required for a particular query, reducing the amount of data that needs to be read from disk. Also, since the columns are stored in a compressed and encoded format, it requires less disk space and network bandwidth.
LZO compressed files may be efficient for data compression, but it can only compress data but does not address the issue of columnar processing. Thus, it may not be the most optimal choice for this scenario.
Therefore, in this case, the best option for file format for market data records on S3 that gives the best performance is C. Parquet files.