Amazon BDS-C00 Exam: Big Data Analytics Solution for Hymutabs Ltd

Redshift Distribution Styles

Question

Hymutabs Ltd (Hymutabs) is a global environmental solutions company running its operations in in Asia Pacific, the Middle East, Africa and the Americas.

It maintains more than 10 exploration labs around the world, including a knowledge centre, an "innovative process development centre" in Singapore, a materials and membrane products development centre as well as advanced machining, prototyping and industrial design functions. Hymutabs hosts their existing enterprise infrastructure on AWS and runs multiple applications to address the product life cycle management.

The datasets are available in Aurora, RDS and S3 in file format.

Hymutabs Management team is interested in building analytics around product life cycle and advanced machining, prototyping and other functions. The IT team proposed Redshift to fulfill the EDW and analytics requirements.

They adapt modeling approaches laid by Bill Inmon and Kimball to efficiently design the solution.

The team understands that the data loaded into Redshift would be in terabytes and identified multiple massive dimensions, facts, summaries of millions of records and are working on establishing the best practices to address the design concerns. There are 6 tables that they are currently working on: ORDER_FCT is a Fact Table with billions of rows related to orders SALES_FCT is a Fact Table with billions of rows related to sales transactions.

This table is specifically used to generate reports EOD (End of Day), EOW(End of Week), and EOM (End of Month) and also sales queries CUST_DIM is a Dimension table with billions of rows related to customers.

It is a TYPE 2 Dimension table PART_DIM is a part dimension table with billions of records that defines the materials that were ordered DATE_DIM is a dimension table SUPPLIER_DIM holds the information about suppliers the Hymutabs work with Most of the sales queries in involve a subset of the customer dimension.

Please advise your distribution styles.

select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer : A.

Option A is correct -KEY DISTRIBUTION distributes the rows are according to the values in one column.

This is the perfect solution with distribution key on same keys.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option B is incorrect - EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.

EVEN distribution is appropriate when a table(s) does not participate in joins.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option C is incorrect - ALL distribution makes a copy of the entire table in every compute node.

Being billion record tables, this is not a right approach to design.

This design cannot be applied for large tables.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option D is incorrect -KEY DISTRIBUTION distributes the rows are according to the values in one column.

With distribution key on different keys, this initiates lot of data copy between nodes and not a right approach.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option E is incorrect -Redshift decides the distribution based on the statistics.

Not a right design approach to build the solution.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html

The best distribution style for Redshift tables depends on the specific use case and query patterns. In this scenario, the sales queries involve a subset of the customer dimension. Therefore, it is advisable to choose a distribution style that co-locates the sales and customer data to minimize network data transfer and improve query performance.

The options given are:

A. DISTRIBUTE SALES_FCT and CUST_DIM on SAME KEY with KEY DISTRIBUTION B. DISTRIBUTE SALES_FCT and CUST_DIM on SAME KEY with EVEN DISTRIBUTION C. DISTRIBUTE SALES_FCT and CUST_DIM on SAME KEY with ALL DISTRIBUTION D. DISTRIBUTE SALES_FCT and CUST_DIM on DIFFERENT KEYS with KEY DISTRIBUTION E. DISTRIBUTE SALES_FCT and CUST_DIM on SAME KEY by not specifying DISTSTYLE.

Option A: DISTRIBUTE SALES_FCT and CUST_DIM on SAME KEY with KEY DISTRIBUTION This option would distribute both the sales and customer dimension tables on the same key column, ensuring that the same customer data and sales data are stored on the same Redshift node. This option is suitable when the customer dimension table is frequently joined with the sales fact table. However, it may not be the best option for this scenario since the customer dimension table is a Type 2 slowly changing dimension, which means that it can have many historical versions for each customer.

Option B: DISTRIBUTE SALES_FCT and CUST_DIM on SAME KEY with EVEN DISTRIBUTION This option would distribute both tables on the same key column, but with even distribution. This means that Redshift would distribute the data evenly across all the nodes based on the key column. This option is suitable when the sales and customer data have a similar size, and the join queries are well distributed. However, it may not be the best option for this scenario since the customer dimension table has billions of rows and is a slowly changing dimension.

Option C: DISTRIBUTE SALES_FCT and CUST_DIM on SAME KEY with ALL DISTRIBUTION This option would replicate both the sales and customer data on all the nodes, ensuring that each node has a full copy of the data. This option is suitable when the query patterns involve small tables and aggregation queries. However, it may not be the best option for this scenario since the sales fact table has billions of rows, and replication could cause a significant storage overhead.

Option D: DISTRIBUTE SALES_FCT and CUST_DIM on DIFFERENT KEYS with KEY DISTRIBUTION This option would distribute the sales and customer data on different key columns, ensuring that they are co-located based on the join columns. This option is suitable when the query patterns involve join queries on different columns. However, it may not be the best option for this scenario since the customer dimension table is a Type 2 slowly changing dimension.

Option E: DISTRIBUTE SALES_FCT and CUST_DIM on SAME KEY by not specifying DISTSTYLE. This option would let Redshift choose the default distribution style, which is based on the table size. This option is suitable when the data is well distributed, and the query patterns involve simple queries. However, it may not be the best option for this scenario since the sales fact table has billions of rows, and the customer dimension table is a Type 2 slowly changing dimension.

Based on the above analysis, Option A is the most suitable option for this scenario since it co-locates the sales and customer data on the same key column. However, it is essential to note that the distribution style depends on the specific use case and query patterns. Therefore, it is crucial to test different distribution styles and measure their performance before implementing them in a production environment.