Verify Data Loading in Redshift: Methods and Techniques

Methods to Verify Data Loading into Redshift Tables

Prev Question Next Question

Question

A company is planning to load data into a Redshift table from multiple files in an S3 bucket.

They want to verify that the data was loaded correctly.

How can they verify this?

Answers

A. Check the Cloudwatch log metrics to check if the data was loaded properly

B. Query the STL_LOAD_COMMITS table

C. Check the Cloudtrail log metrics to check if the data was loaded properly

D. Use the Analyze command.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B.

The AWS Documentation mentions the following.

#######

Verifying That the Data Was Loaded Correctly.

After the load operation is complete, query the STL_LOAD_COMMITS system table to verify that the expected files were loaded.

You should execute the COPY command and load verification within the same transaction so that if there is problem with the load you can roll back the entire transaction.

The following query returns entries for loading the tables in the TICKIT database:

#######

Option A is incorrect since here metrics about the cluster itself will be recorded and the not the data load process.

Option C is incorrect since here the API activity for the service will be recorded and the not the data load process.

Option D is incorrect since this is used to update the statistical metadata that the query planner uses to build and choose optimal plan.

For more information on verifying whether the data was loaded properly, please refer to the below URL.

https://docs.aws.amazon.com/redshift/latest/dg/verifying-that-data-loaded-correctly.html

select query, trim(filename) as filename, curtime, status
from stl_load_commits
where filename like ‘Stickit’' order by query;

query | btrim | curtime | status

22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 1
22478 | tickit/venue_pipe.txt 2013-02-08 20:58:25.070604 1
22480 | tickit/category pipe.txt | 2013-02-08 27.333472 1
22482 | tickit/date2008 pipe.txt | 2013-02-08 28.608305 1
22485 | tickit/allevents_pipe.txt | 2013-02-08 29.99489 1
22487 | tickit/listings pipe.txt | 2013-02-08 37.632939 1
22489 | tickit/sales_tab.txt 2013-02-08 20:58:37.632939 1

(6 rows)

To verify that the data was loaded correctly into a Redshift table from multiple files in an S3 bucket, we can use the following options:

A. Check the Cloudwatch log metrics to check if the data was loaded properly: We can use CloudWatch to monitor the load process and verify if the data was loaded correctly. We can create CloudWatch metrics and alarms to monitor the number of rows inserted, updated, or deleted. We can also monitor the duration of the load process and any errors that may have occurred during the load. This option can be used to monitor the load process in real-time.

B. Query the STL_LOAD_COMMITS table: When data is loaded into Redshift, it is tracked in the STL_LOAD_COMMITS system table. We can query this table to verify if the data was loaded correctly. The STL_LOAD_COMMITS table contains information about the number of rows inserted, deleted, and updated, as well as the start and end time of the load. We can also check the status of the load to verify if it was successful or not.

C. Check the CloudTrail log metrics to check if the data was loaded properly: CloudTrail logs API calls made to Redshift. We can use CloudTrail to verify if the data was loaded correctly by checking the API calls made during the load process. This option can be used to monitor the load process after it has completed.

D. Use the Analyze command: The Analyze command in Redshift can be used to analyze the table and generate statistics about the table. We can use this option to verify if the data was loaded correctly by analyzing the table and verifying that the statistics are accurate.

In conclusion, options B and D are the best options to verify if the data was loaded correctly into a Redshift table from multiple files in an S3 bucket. Option A can be used to monitor the load process in real-time, while option C can be used to monitor the load process after it has completed.

Prev Question Next Question