Ordering Historical Data in Redshift for Efficient Queries

Best Sort Key for Table: Timestamp

Prev Question Next Question

Question

A company is planning on sending and storing historical data for an application in a Redshift cluster.

Below is the table structure.

It basically stores the orders received for an application. Order ID Product ID Order Value Timestamp Most of the queries fired will try to see the recent orders placed.

Which of the following column would be ideal for the sort key for the table?

Answers

A. Order ID

B. Product ID

C. Order Value

D. Timestamp.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - D.

The best practices are provided in the AWS Documentation.

########

Choose the Best Sort Key.

Amazon Redshift stores your data on disk in sorted order according to the sort key.

The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.

Queries are more efficient because they can skip entire blocks that fall outside the time range.

If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.

Amazon Redshift can skip reading entire blocks of data for that column.

It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.

If you frequently join a table, specify the join column as both the sort key and the distribution key.

Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join.

Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

########

Since the AWS Documentation already provides the recommendations, the best option for the sort key would be the timestamp column.

For more information on best practices for sort key, please refer to the below URL.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html

For optimizing queries on a Redshift cluster, it is recommended to use sort keys, which are columns that determine the order in which data is physically stored on disk. Choosing the appropriate sort key can significantly improve query performance.

In this scenario, since most of the queries fired will try to see the recent orders placed, it is recommended to choose the Timestamp column as the sort key. This is because sorting data by Timestamp will group the most recent orders together, making them easy to access and reducing the amount of data that needs to be scanned for queries.

On the other hand, choosing Order ID, Product ID, or Order Value as the sort key would not be as efficient for this use case. Sorting by Order ID or Product ID would group data by those columns, which is not useful for queries looking for recent orders. Sorting by Order Value could also be suboptimal because it does not provide any temporal context to the data.

Therefore, the recommended column to use as the sort key for this table would be Timestamp.

Prev Question Next Question