AWS Redshift Distribution Design for Hymutabs Ltd

Redshift Distribution Design for Hymutabs Ltd

Question

Hymutabs Ltd (Hymutabs) is a global environmental solutions company running its operations in in Asia Pacific, the Middle East, Africa and the Americas.

It maintains more than 10 exploration labs around the world, including a knowledge centre, an "innovative process development centre" in Singapore, a materials and membrane products development centre as well as advanced machining, prototyping and industrial design functions. Hymutabs hosts their existing enterprise infrastructure on AWS and runs multiple applications to address the product life cycle management. The datasets are available in Aurora, RDS and S3 in file format.

Hymutabs Management team is interested in building analytics around product life cycle and advanced machining, prototyping and other functions. The IT team proposed Redshift to fulfill the EDW and analytics requirements.

They adapt modeling approaches laid by Bill Inmon and Kimball to efficiently design the solution.

The team understands that the data loaded into Redshift would be in terabytes and identified multiple massive dimensions, facts, summaries of millions of records and are working on establishing the best practices to address the design concerns. There are 6 tables that they are currently working on: ORDER_FCT is a Fact Table with billions of rows related to orders SALES_FCT is a Fact Table with billions of rows related to sales transactions.

This table is specifically used to generate reports EOD (End of Day), EOW(End of Week), and EOM (End of Month) and also sales queries ?CUST_DIM is a Dimension table with billions of rows related to customers.

It is a TYPE 2 Dimension table PART_DIM is a part dimension table with billions of records that defines the materials that were ordered DATE_DIM is a dimension table SUPPLIER_DIM holds the information about suppliers the Hymutabs work with One of the key requirements includes ORDER_FCT and PART_DIM are joined together in most of order related queries.

ORDER_FCT has many other dimensions to support analysis. How would you design the distribution? Select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer : D.

Option A is incorrect - KEY DISTRIBUTION distributes the rows are according to the values in one column.

Queries initiate lot of redistribution of data of both ORDER_FCT and PART_DIM are not built on same key.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option B is incorrect - ALL distribution makes a copy of the entire table in every compute node.

Being billion record tables, this is not a right approach to design.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option C is incorrect - EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.

EVEN distribution is appropriate when a table does not participate in joins.

Definitely not a right approach.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option D is correct - KEY DISTRIBUTION distributes the rows are according to the values in one column.

With distribution of data on same key in both the tables, there is no change of redistribution.

This is the best approach to design.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option E is incorrect - EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any.

particular column.

EVEN distribution is appropriate when a table does not participate in joins.

Definitely not a right approach.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

To design the distribution for the ORDER_FCT and PART_DIM tables in Redshift, we need to consider the characteristics of each table and their relationship with each other. The distribution style determines how the data is physically stored across the nodes in the cluster, which can impact query performance.

In this scenario, ORDER_FCT is a Fact table with billions of rows related to orders, and it has many other dimensions to support analysis. PART_DIM is a part dimension table with billions of records that defines the materials that were ordered. It is important to note that ORDER_FCT and PART_DIM are joined together in most of the order-related queries.

Based on this information, the recommended distribution style is KEY distribution for both tables, with each table distributed on its PRIMARY KEY. The PRIMARY KEY is the column or set of columns that uniquely identifies each row in the table.

By distributing the ORDER_FCT table on its PRIMARY KEY, we can ensure that rows with the same key are stored on the same node, which can reduce the data transfer and improve query performance. Since ORDER_FCT and PART_DIM are joined together in most of the order-related queries, distributing the PART_DIM table on its PRIMARY KEY ensures that the rows with the same key are co-located with the corresponding rows in ORDER_FCT, which can further improve query performance.

Distributing the tables with ALL distribution would mean that the entire table is replicated on all nodes, which is not ideal for tables with billions of rows as it can consume a lot of storage and increase the query processing time. Distributing the tables with EVEN distribution would mean that the data is distributed evenly across all nodes, which may not be optimal for tables with skewed data distribution.

Therefore, option A is the recommended distribution design for the ORDER_FCT and PART_DIM tables in this scenario.