AWS Redshift Spectrum Best Practices | Query Performance, Cost, and Security Optimization

Improving Query Performance and Security for AWS Redshift Spectrum

Question

Parson Fortunes Ltd is an Asian-based department store operator with an extensive network of 131 stores, spanning approximately 4.1 million square meters of retail space across cities in India, China, Vietnam, Indonesia and Myanmar.

Parson built a VPC to host their entire enterprise infrastructure on cloud.

Parson has large assets of data around 20 TB's of structured data and 45 TB of unstructured data and is planning to host their data warehouse on AWS and unstructured data storage on S3

The files sent from their on premise data center are also hosted into S3 buckets.

Parson IT team is well aware of the scalability, performance of AWS services capabilities.

Parson hosts their web applications, databases and the data warehouse built on Redshift in VPC The structured, semi-structured and unstructured formats are stored in S3 in various buckets. This data be joined and queried along with data in Redshift using Redshift Spectrum.

What are the below best practices that can be enabled to improve query performance, overall costs and security? Select 5 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F. G.

Answer : A, C, E, F, G.

Option A is correct -Break large files into many smaller files.

We recommend using file sizes of 64 MB or larger.

Store files for a table in the same folder.

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

Option B is incorrect -Break large files into many smaller files.

We recommend using file sizes of 64 MB or larger.

Store files for a table in the same folder.

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

Option C is correct -Break large files into many smaller files.

We recommend using file sizes of 64 MB or larger.

Store files for a table in the same folder.

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

Option D is incorrect -Break large files into many smaller files.

We recommend using file sizes of 64 MB or larger.

Store files for a table in the same folder.

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

Option E is correct -Keep all the files about the same size.

If some files are much larger than others, Redshift Spectrum can't distribute the workload evenly.

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

Option F is correct -To reduce storage space, improve performance, and minimize costs, we strongly recommend compressing your data files.

Redshift Spectrum recognizes file compression types based on the file extension.

Redshift Spectrum supports the following compression types and extensions:

· gzip - .gz.

· Snappy - .snappy.

· bzip2 - .bz2

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

Option G is correct -Redshift Spectrum transparently decrypts data files that are encrypted using the following encryption options:

· Server-side encryption (SSE-S3) using an AES-256 encryption key managed by Amazon S3.

· Server-side encryption with keys managed by AWS Key Management Service (SSE-KMS)

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

Option H is incorrect - Redshift Spectrum doesn't support Amazon S3 client-side encryption.

https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html

Parson Fortunes Ltd has built a VPC to host their entire enterprise infrastructure on cloud, and they plan to host their data warehouse on AWS and unstructured data storage on S3. They have a large amount of data, around 20 TB's of structured data and 45 TB of unstructured data. The files sent from their on-premise data center are also hosted into S3 buckets.

To improve query performance, overall costs, and security, the following best practices should be enabled:

A. Break large files into many smaller files, typically of file size of 64 MB or larger: This helps to distribute the workload evenly across the Redshift Spectrum clusters. Smaller files can also be loaded faster, and it reduces the time to access the data.

B. Group smaller files into a single large file, typically of file size of 512 MB or larger: This helps to minimize the number of files scanned by Redshift Spectrum. Fewer files reduce the amount of metadata Redshift Spectrum needs to read, which can improve query performance.

C. Store files for a table in the same folder: By storing files for a table in the same folder, it reduces the number of objects that need to be scanned when querying a table.

D. Store files for a table into different sub-folders: When storing files for a table into different sub-folders, it can help to partition the data. Partitioning can improve query performance by reducing the amount of data scanned.

E. Keep all the files about the same size. If some files are much larger than others, Redshift Spectrum can't distribute the workload evenly: This is important for efficient processing of data. When the files are about the same size, Redshift Spectrum can distribute the workload evenly across the Redshift Spectrum clusters.

F. To reduce storage space, improve performance, and minimize costs, compress data files into gzip, snappy, or bzip2: This reduces the amount of storage space required for the data, which can help to minimize costs. Compressed data files can also be loaded faster and improve query performance.

G. Use server-side encryption (SSE-S3) using an AES-256 encryption key, managed by Amazon S3 or keys managed by AWS Key Management Service (SSE-KMS): This is important for security. It helps to protect the data in S3 from unauthorized access.

H. Use S3 client-side encryption: This is another option for securing data stored in S3. When using S3 client-side encryption, the data is encrypted before it is sent to S3. This helps to protect the data in transit and at rest.

In summary, by following the above best practices, Parson Fortunes Ltd can improve the query performance, minimize costs, and secure their data stored in S3.