A company needs to process large amounts of data and store them in a data store accordingly.
The data consists of all IP addresses which are accessing their website.
There would be around billions of rows being stored in the data store.
The company have decided to use the AWS EMR service.
The company needs to be able to query the data efficiently based on the IP address.
Which of the following would be an ideal implementation plan for this?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer - B.
The best way to handle so much of data is to make use of HBase.
The AWS Blog mentions the following.
HBase offers a number of powerful features including:
Strictly consistent reads and writes.
High write throughput.
Automatic sharding of tables.
Efficient storage of sparse data.
Low-latency data access via in-memory operations.
Direct input and output to Hadoop jobs.
Integration with Apache Hive for SQL-like queries over HBase tables, joins, and JDBC support.
Options A and C would be less efficient than using HBase.
Option D would just be an overhead.
For more information on Hbase on EMR, please visit the url.
https://aws.amazon.com/blogs/aws/apache-hbase-on-emr/The ideal implementation plan for processing large amounts of data and storing them in a data store that can be efficiently queried based on IP addresses using AWS EMR would be to use S3 as the underlying storage for the EMR cluster, with prefixes that have the IP address attached.
Option A, which suggests creating a bucket for each IP address, is not a scalable solution and would be impractical to implement in practice, especially with billions of rows.
Option B, which suggests using HBase on EMR, is a viable solution. However, HBase is optimized for random access, and since the data needs to be queried based on IP addresses, the use of HBase may not be the most efficient solution.
Option D, which suggests posting the data from EMR to Redshift for analysis, may not be the most efficient solution as Redshift is optimized for analytical queries and may not be the best choice for querying data based on IP addresses.
Therefore, option C is the best solution. By using S3 as the underlying storage for the EMR cluster and attaching the IP address as a prefix to the filenames, the data can be efficiently queried based on IP addresses. S3 is optimized for storage and retrieval of large datasets and can handle billions of rows with ease. Additionally, EMR can easily process data stored in S3, making it an ideal choice for this scenario.
In conclusion, using S3 as the underlying storage for the EMR cluster with prefixes that have the IP address attached is the ideal implementation plan for processing and storing large amounts of data efficiently and querying the data based on IP addresses.