How to Automatically Delete Data in S3 after Streaming Logs with Kinesis, Firehose, Lambda, and Redshift

Automatically Deleting Data in S3 after Streaming Logs with AWS Services

Question

A company wants to stream their log files from their EC2 Instances.

You are using Kinesis streams and Firehose for this process.

The data will be parsed using AWS Lambda and then the resultant data will be stored in AWS Redshift.

After the process is complete the amount of data in S3 has increased, and you had to delete the data manually.

Since this process will be triggered on a continual basis, you need to ensure the right step is taken to delete the data in S3

How can you accomplish this?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - A.

The AWS Documentation mentions the following.

Kinesis Data Firehose delivers your data to your S3 bucket first and then issues an Amazon Redshift COPY command to load the data into your Amazon Redshift cluster.

Specify an S3 bucket that you own where the streaming data should be delivered.

Create a new S3 bucket or choose an existing bucket that you own.

Kinesis Data Firehose doesn't delete the data from your S3 bucket after loading it to your Amazon Redshift cluster.

You can manage the data in your S3 bucket using a lifecycle configuration.

Option B is incorrect because triggers don't exist in Redshift.

Option C is incorrect sinceS3 events would need to call a Lambda function and cannot conduct the clean-up automatically.

Option D is incorrect since the data build is because the intermediate data from the Kinesis firehose process is stored in S3 first.

For more information on using various destinations with Firehose, please visit the url.

https://docs.aws.amazon.com/firehose/latest/dev/create-destination.html

To ensure that data in S3 is automatically deleted after it has been processed, we can use S3 events to trigger a Lambda function that deletes the processed data from the S3 bucket.

Here are the steps to accomplish this:

  1. Create an S3 bucket that will be used to store the log files.
  2. Configure the EC2 instances to stream the log files to Kinesis streams or Firehose.
  3. Use AWS Lambda to parse the log data and transform it into a format that can be stored in Redshift.
  4. Configure Kinesis streams or Firehose to deliver the transformed data to an S3 bucket.
  5. Create an S3 bucket event to trigger a Lambda function to delete the processed data from the S3 bucket.
  6. Use AWS Redshift to load the transformed data from the S3 bucket.

Option A: Creating a Lifecycle policy for the S3 bucket can help to automatically delete objects from an S3 bucket based on specific rules such as age or object version. However, this is not the best option for this scenario since it would delete all the data in the bucket regardless of whether it has been processed or not.

Option B: Using Redshift triggers to delete the data after it has finished loading is not a recommended solution. Redshift triggers are used to automate tasks such as updating summary tables, but they are not designed to delete data from an S3 bucket.

Option C: Using S3 events to delete the data after it has finished loading is the recommended solution. S3 events can be used to trigger a Lambda function to delete the processed data from the S3 bucket.

Option D: Disabling S3 logging is not recommended since it is not related to the issue of deleting processed data from the S3 bucket.