AWS Certified Big Data - Specialty Exam: Ideal Data Store for Efficient Storage of Sparse Data

Prev Question Next Question

Question

A company needs to have a data store in AWS.

The company is responsible to getting weather data and then performing the required analysis on the data.

The amount of data can go into Petabytes.

It needs to be ensured that storage is efficient when it comes to storage of sparse data.

Which of the following would be the MOST ideal data store?

Answers

A. AWS Redshift

B. AWS EMR with HBase

C. AWS RDS

D. AWS DynamoDB.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B.

This is mentioned in one of the whitepapers.

For large datasets, such as log data, weather data, product catalogs, and so on, you might already have large amounts of historical data that you want to maintain for historical trend analysis, but need to ingest and batch process current data for predictive purposes.

For these types of workloads, Apache HBase is a good choice because of its high read and write throughput and efficient storage of sparse data.

Option A is incorrect since this is used for data warehousing purposes.

Option C is incorrect since this is used for OLTP types of databases.

Option D is partially correct but here EMR with Hbase is better.

For more information on DynamoDB vs HBase, please refer to the below URL.

https://d1.awsstatic.com/whitepapers/AWS_Comparing_the_Use_of_DynamoDB_and_HBase_for_NoSQL.pdf

Based on the requirements mentioned in the question, the MOST ideal data store for storing and analyzing large volumes of sparse weather data in AWS would be AWS EMR with HBase.

Here's why:

AWS Redshift is a data warehousing solution, best suited for analyzing structured data. It is not an ideal solution for storing sparse data as it requires data to be loaded in a structured manner, which may not be possible with sparse data. Additionally, Redshift may not be the best solution for analyzing petabytes of data as it is not optimized for handling large volumes of data.

AWS RDS is a relational database service, and while it can handle large volumes of data, it may not be the best solution for sparse data as it is not optimized for sparse data. Additionally, RDS may not be the best solution for analyzing petabytes of data as it may become slow and inefficient at such a scale.

AWS DynamoDB is a NoSQL database that is designed for high scalability and high performance. It is well-suited for storing and querying structured data. However, it may not be the best solution for storing sparse data as it is not optimized for sparse data. Additionally, DynamoDB may not be the best solution for analyzing petabytes of data as it may become slow and inefficient at such a scale.

AWS EMR with HBase is a distributed computing framework that can handle large volumes of data and is optimized for handling sparse data. It provides high scalability, high availability, and high performance. HBase, in particular, is well-suited for storing sparse data as it is optimized for efficient storage and retrieval of sparse data. Additionally, EMR with HBase is a cost-effective solution for storing and analyzing petabytes of data.

In summary, AWS EMR with HBase is the MOST ideal data store for storing and analyzing large volumes of sparse weather data in AWS.

Prev Question Next Question