AWS Redshift Distribution Styles for Big Data Analytics | Hymutabs Ltd

Redshift Distribution Styles for ORDER_FCT and SALES_FCT Tables

Question

Hymutabs Ltd (Hymutabs) is a global environmental solutions company running its operations in in Asia Pacific, the Middle East, Africa and the Americas.

It maintains more than 10 exploration labs around the world, including a knowledge centre, an "innovative process development centre" in Singapore, a materials and membrane products development centre as well as advanced machining, prototyping and industrial design functions. Hymutabs hosts their existing enterprise infrastructure on AWS and runs multiple applications to address the product life cycle management.

The datasets are available in Aurora, RDS and S3 in file format.

Hymutabs Management team is interested in building analytics around product life cycle and advanced machining, prototyping and other functions. The IT team proposed Redshift to fulfill the EDW and analytics requirements.

They adapt modeling approaches laid by Bill Inmon and Kimball to efficiently design the solution.

The team understands that the data loaded into Redshift would be in terabytes and identified multiple massive dimensions, facts, summaries of millions of records and are working on establishing the best practices to address the design concerns. There are 6 tables that they are currently working on: ORDER_FCT is a Fact Table with billions of rows related to orders SALES_FCT is a Fact Table with billions of rows related to sales transactions.

This table is specifically used to generate reports EOD (End of Day), EOW(End of Week), and EOM (End of Month) and also sales queries CUST_DIM is a Dimension table with billions of rows related to customers.

It is a TYPE 2 Dimension table PART_DIM is a part dimension table with billions of records that defines the materials that were ordered DATE_DIM is a dimension table SUPPLIER_DIM holds the information about suppliers the Hymutabs work with SALES_FCT and DATE_DIM are joined together frequently since EOD sales reports are generated every day.

please suggest your distribution style for both tables.

Select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer : C.

Option A is incorrect -KEY DISTRIBUTION distributes the rows are according to the values in one column.

This is a right approach to design the table, but DATE_DIM with KEY DISTRIBUTION with number of records being very low, lot of data is copied between nodes.

This approach is ok but not a perfect design to build the solution.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option B is incorrect -EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.

EVEN distribution is appropriate when a table does not participate in joins.

For a fact table like SALES_FCT, all the nodes participate in all queries even though the EOD reports is only for that particular day.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option C is correct -ALL distribution makes a copy of the entire table in every compute node.

Being billion record tables, this is not a right approach to design.This is the perfect design for DATE_DIM table which has very low number and can be distributed to all tables.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option D is incorrect -ALL distribution makes a copy of the entire table in every compute node.

Being billion record tables, this is not a right approach to design.Cannot be used for massive table like SALES_FCT.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option E is incorrect -EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.

EVEN distribution is appropriate when a table does not participate in joins.

For a fact table like SALES_FCT, all the nodes participate in all queries even though the EOD reports is only for that particular day.

SALES_FCT TABLE need to be designed on a table with a perfect distribution key in mind.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

In this scenario, the Hymutabs Management team is interested in building analytics around product life cycle and advanced machining, prototyping, and other functions. The IT team proposed Redshift to fulfill the EDW and analytics requirements. The team understands that the data loaded into Redshift would be in terabytes and identified multiple massive dimensions, facts, summaries of millions of records, and are working on establishing the best practices to address the design concerns.

There are 6 tables that they are currently working on, including ORDER_FCT (Fact Table), SALES_FCT (Fact Table), CUST_DIM (Dimension table), PART_DIM (Part Dimension table), DATE_DIM (Dimension table), and SUPPLIER_DIM (Dimension table). SALES_FCT and DATE_DIM are joined together frequently since EOD sales reports are generated every day.

In Redshift, selecting the correct distribution style is critical to achieve optimal performance. Distribution style determines how data is distributed across the nodes in the Redshift cluster. It is important to choose the right distribution style to ensure that the data is distributed evenly, allowing queries to execute efficiently across the nodes.

The distribution style can be one of the following:

  1. Even Distribution: Data is distributed across all the nodes evenly.
  2. Key Distribution: Data is distributed based on the values in a particular column, and all the rows with the same value in that column are assigned to the same node.
  3. All Distribution: All copies of the data are distributed to every node in the cluster.

Now, let's consider the options provided in the question and evaluate them based on the details provided.

Option A: Distribute the SALES_FCT with KEY DISTRIBUTION on its own Primary KEY ( one of the columns ) while DATE_DIM is distributed with KEY DISTRIBUTION on Its PRIMARY KEY

This option suggests using KEY DISTRIBUTION for both SALES_FCT and DATE_DIM tables. However, it is not clear which column would be used for the distribution key. Moreover, the question provides the information that SALES_FCT and DATE_DIM tables are frequently joined together since EOD sales reports are generated every day. If both tables are distributed with KEY DISTRIBUTION, it may result in data skew, where some nodes may be overloaded with data, leading to poor query performance.

Option B: Distribute the SALES_FCT with EVEN DISTRIBUTION on its own Primary KEY ( one of the columns ) while DATE_DIM is distributed with EVEN distribution on Its PRIMARY KEY

This option suggests using EVEN DISTRIBUTION for both SALES_FCT and DATE_DIM tables. Since SALES_FCT is frequently joined with DATE_DIM, it is better to distribute both tables with the same distribution key to avoid data skew. However, it is not clear which column would be used for the distribution key.

Option C: Distribute the SALES_FCT with KEY DISTRIBUTION on its own Primary KEY ( one of the columns ) while DATE_DIM is distributed with ALL DISTRIBUTION on Its PRIMARY KEY

This option suggests using KEY DISTRIBUTION for SALES_FCT and ALL DISTRIBUTION for DATE_DIM. Since SALES_FCT is a Fact Table with billions of rows related to sales transactions, it is better to distribute it with KEY DISTRIBUTION on its own Primary KEY. However, since DATE_DIM is a Dimension table, it is small in size and can be distributed with ALL DISTRIBUTION.

Option D: Distribute the SALES_FCT with ALL DISTRIBUTION on its own Primary KEY ( one of the columns ) while DATE_DIM is distributed with EVEN distribution on Its PRIMARY KEY

This option suggests using ALL DISTRIBUTION for SALES_FCT and EVEN DISTRIBUTION for DATE_DIM. Since SALES_FCT is a Fact Table with billions of rows related to sales transactions, it is better to distribute it with ALL