Implementing an Azure Databricks Workspace: Clusters for Data Engineers, Data Scientists, and Jobs

Creating Azure Databricks Clusters for Different Workloads

Question

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:

-> A workload for data engineers who will use Python and SQL

-> A workload for jobs that will run notebooks that use Python, Scala, and SQL

-> A workload that data scientists will use to perform ad hoc analysis in Scala and R

The enterprise architecture team at your company identifies the following standards for Databricks environments:

-> The data engineers must share a cluster.

-> The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.

-> All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the jobs.

Does this meet the goal?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B.

B

We would need a High Concurrency cluster for the jobs.

Note:

Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.

A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.

https://docs.azuredatabricks.net/clusters/configure.html

The proposed solution meets the requirements specified in the scenario. The following is a breakdown of the proposed solution and how it meets the specified requirements:

  1. Workload for data engineers who will use Python and SQL: The proposed solution recommends using a High Concurrency cluster for the data engineers. This satisfies the requirement that data engineers share a cluster. A High Concurrency cluster provides a multi-user environment that can support up to thousands of users. This is a suitable choice for a team of data engineers who will be sharing a cluster.

  2. Workload for jobs that will run notebooks that use Python, Scala, and SQL: The proposed solution recommends using a Standard cluster for the jobs. A Standard cluster provides a single-user environment that is suitable for running jobs that do not require high concurrency. This satisfies the requirement that the job cluster be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.

  3. Workload that data scientists will use to perform ad hoc analysis in Scala and R: The proposed solution recommends creating a Standard cluster for each data scientist. This satisfies the requirement that all the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. By creating a separate cluster for each data scientist, the solution ensures that each scientist has a dedicated environment that can be customized to their individual needs. Additionally, the automatic termination feature ensures that the resources allocated to the clusters are released when not in use, thus optimizing resource utilization.

In conclusion, the proposed solution meets all the requirements specified in the scenario, and is a suitable solution for creating a tiered structure of Azure Databricks clusters for the specified workloads. Therefore, the correct answer is A. Yes.