BigM, one of the largest financial services company running their entire infrastructure out of AWS and runs over 100's of EC2 instance running over multiple regions to support their workloads.
BigM is looking for a solution for ingesting, augmenting, and analyzing terabytes of data its network generated daily in the form of VPC flow logs which enables BigM to identify performance-improvement opportunities, such as identifying apps that are communicating across regions and co-locating them.
The company would also be able to increase uptime by quickly detecting and mitigating application downtime and reduce TCO by positioning right compute at right location.
These logs are captured real-time, standardized, loaded into kinesis streams which are then consumed into S3 using KCL thereby help analyze using presto on EMR for interactive querying using pre-built external tables and feed real-time dashboards.
Which of the following components address the following requirements? Monitoring file rotation, check pointing, and retry upon failures of logs Ingestion of VPC Logs Data Standardization Data loading into S3 The location of the metadata of the external tables of presto on EMR is stored Select 3 options.
Click on the arrows to vote for the correct answer
A. B. C. D. E. F.Answer : A, C and D.
Option A is correct - Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams.
The agent continuously monitors a set of files and sends new data to your stream.
The agent handles file rotation, check pointing, and retry upon failures.
It delivers all of your data in a reliable, timely, and simple manner.
It also emits Amazon CloudWatch metrics to help you better monitor and troubleshoot the streaming process.
Configure the agent to monitor multiple file directories and send data to multiple streams.
The agent can pre-process the records parsed from monitored files before sending them to your stream.
https://docs.aws.amazon.com/streams/latest/dev/writing-with-agents.html#sim-writesOption B is incorrect - HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.
HCatalog has a REST interface and command line client that allows you to create tables or do other operations.
You then write your applications to access the tables using HCatalog libraries.
Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources.
Metadata of Presto on EMR is stored in Glue Data Catalog.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hcatalog.htmlOption C is correct - This is a pre-built library that helps you easily integrate Amazon Kinesis Data Streams with other AWS services and third-party tools.
Amazon Kinesis Client Library (KCL) is required for using this library.
The current version of this library provides connectors to Amazon DynamoDB, Amazon Redshift, Amazon S3, and Elasticsearch.
https://aws.amazon.com/kinesis/data-streams/resources/Option D is correct - The AWS Glue Data Catalog as the default Hive metastore for Presto.
We recommend this configuration when you require a persistent metastore or a metastore shared by different clusters, services, applications, or AWS accounts.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores.
The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore.
AWS Glue crawlers can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.htmlOption E is incorrect - Cloudwatch does not fit into this requirement.
Kinesis agent serves the purpose.
Besides Amazon CloudWatch is a monitoring and management service built for developers, system operators, site reliability engineers (SRE), and IT managers.
CloudWatch provides you with data and actionable insights to monitor your applications, understand and respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health.
CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications and services that run on AWS, and on-premises servers.
You can use CloudWatch to set high resolution alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to optimize your applications, and ensure they are running smoothly.
https://aws.amazon.com/cloudwatch/Option F is incorrect -Streams API cannot be used to monitor, capture and standardize file changes.
Streams API, using PutRecords operation sends multiple records to Kinesis Data Streams in a single request.
By using PutRecords, producers can achieve higher throughput when sending data to their Kinesis data stream.
Each PutRecords request can support up to 500 records.
Each record in the request can be as large as 1 MB, up to a limit of 5 MB for the entire request, including partition keys.
Also the platform programmatically supports changing between submissions of single records versus multiple records in a single HTTP request.
https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-sdk.htmlThe components that address the given requirements are:
Monitoring file rotation, check pointing, and retry upon failures of logs: CloudWatch for monitoring files and changes CloudWatch is a monitoring service provided by AWS that can be used to monitor log files in real-time. In this case, CloudWatch can be used to monitor the rotation of log files, check pointing, and retry upon failures of logs. It can be set up to automatically trigger actions when specific events occur, such as notifying an administrator when a log file is rotated or when an error occurs during log ingestion.
Ingestion of VPC Logs: Kinesis Agent and Kinesis Connector Library Kinesis Agent and Kinesis Connector Library are two components of AWS Kinesis that can be used for ingesting data into Kinesis streams. In this case, VPC logs are ingested in real-time into Kinesis streams using Kinesis Agent. The Kinesis Connector Library can be used to load data from Kinesis streams into S3. These components can handle large amounts of data in real-time and provide a scalable solution for ingesting data into AWS.
Data Standardization: No specific component is mentioned The question mentions that the logs are standardized, but it does not mention any specific component that is used for data standardization. It is possible that custom scripts or other tools are used for data standardization.
Data loading into S3: Streams API for data consumption into S3 The Streams API is a component of AWS Kinesis that can be used for consuming data from Kinesis streams and loading it into S3. In this case, the Kinesis Connector Library can be used to load data from Kinesis streams into S3.
The location of the metadata of the external tables of Presto on EMR is stored: Metadata of Presto on EMR is stored in Glue Data Catalog Presto is an open-source distributed SQL query engine that can be used for interactive querying of large datasets. In this case, Presto is used on EMR for interactive querying using pre-built external tables. The metadata of these external tables is stored in the Glue Data Catalog, which is a fully-managed metadata repository provided by AWS. The Glue Data Catalog provides a centralized location for storing and managing metadata, making it easier to integrate different AWS services.
In summary, the components that address the given requirements are CloudWatch for monitoring file rotation, Kinesis Agent and Kinesis Connector Library for ingestion of VPC logs, no specific component is mentioned for data standardization, Streams API for data loading into S3, and Glue Data Catalog for storing the metadata of external tables in Presto on EMR.