Safest Approach for Distributing LOCATION_DIM Table

Safest Approach for Distributing LOCATION_DIM Table

Question

Hymutabs Ltd (Hymutabs) is a global environmental solutions company running its operations in in Asia Pacific, the Middle East, Africa and the Americas.

It maintains more than 10 exploration labs around the world, including a knowledge centre, an "innovative process development centre" in Singapore, a materials and membrane products development centre as well as advanced machining, prototyping and industrial design functions. Hymutabs hosts their existing enterprise infrastructure on AWS and runs multiple applications to address the product life cycle management.

The datasets are available in Aurora, RDS and S3 in file format.

Hymutabs Management team is interested in building analytics around product life cycle and advanced machining, prototyping and other functions. The IT team proposed Redshift to fulfill the EDW and analytics requirements.

They adapt modeling approaches laid by Bill Inmon and Kimball to efficiently design the solution.

The team understands that the data loaded into Redshift would be in terabytes and identified multiple massive dimensions, facts, summaries of millions of records and are working on establishing the best practices to address the design concerns. There are 6 tables that they are currently working on: ORDER_FCT is a Fact Table with billions of rows related to orders SALES_FCT is a Fact Table with billions of rows related to sales transactions.

This table is specifically used to generate reports EOD (End of Day), EOW(End of Week), and EOM (End of Month) and also sales queries CUST_DIM is a Dimension table with billions of rows related to customers.

It is a TYPE 2 Dimension table PART_DIM is a part dimension table with billions of records that defines the materials that were ordered DATE_DIM is a dimension table SUPPLIER_DIM holds the information about suppliers the Hymutabs work with LOCATION_DIM is a newly identified table and has around 2.8 million rows and size increases 4% every month. Hymutabs has very limited number of suppliers.

The administrator just left the company for good and also not available during design meetings.

There is a urgent need to deploy LOCATION_DIM.

Being a new table and no workload requirements, the team is not aware of what approach need to be taken.

What is the safest approach for distribution? select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer : D.

Option A is incorrect - KEY DISTRIBUTION distributes the rows are according to the values in one column.

This is the perfect solution with distribution key on same keys.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option B is incorrect - EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.

EVEN distribution is appropriate when a table(s) does not participate in joins.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option C is incorrect -ALL distribution makes a copy of the entire table in every compute node.

This is not a right approach to design.

This design cannot be applied for large tables.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option D is correct -when the administrator is not available and we are not sure of the workload, it is always better to go with no no distribution style because RedShift handles the distribution mechanism.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html

When designing a Redshift cluster, the distribution style of tables is an important factor to consider as it affects the performance and scalability of the system. The distribution style determines how the data is distributed across the nodes of the cluster for parallel processing.

There are three distribution styles available in Amazon Redshift: KEY, EVEN, and ALL.

KEY distribution is used when a table is joined frequently with another table using a specific column. In this case, the table is distributed based on the values of the column used for the join, and all rows with the same value are stored on the same node. This ensures that the join operation is performed locally on each node and can be very efficient. However, this approach can result in data skew and hotspots if the distribution key is not well chosen, and it may not be suitable for tables that are frequently updated.

EVEN distribution is used when the table is not joined frequently with another table or when there is no clear distribution key. In this case, the data is distributed evenly across all nodes of the cluster, and each node processes a subset of the data. This approach ensures that the workload is evenly distributed and can scale well as the cluster grows, but it may not be optimal for join operations.

ALL distribution is used for small reference tables that are frequently joined with other tables. In this case, the table is replicated on all nodes of the cluster, so the join operation can be performed locally on each node. This approach is suitable for small tables that do not change frequently but may not be scalable for large tables.

Given the information provided in the question, the safest approach for distributing the LOCATION_DIM table would be to use the ALL distribution style. This table has a relatively small size of 2.8 million rows and increases by only 4% per month, which makes it suitable for replication on all nodes of the cluster. Also, there is no workload requirement mentioned, so the performance impact of replication is likely to be minimal.

Distributing the LOCATION_DIM table using KEY or EVEN distribution may not be optimal as there is no clear distribution key mentioned, and the table is not frequently joined with other tables. Using ALL distribution ensures that the join operation can be performed locally on each node and eliminates the risk of data skew or hotspots.

Distributing the LOCATION_DIM table without any distribution style (i.e., using the default distribution style) is not a valid option as Redshift requires each table to have a distribution style.