Azure Databricks Cluster Configuration for Different Workloads

Create Databricks Clusters for Data Engineers, Jobs, and Data Scientists

Question

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:

-> A workload for data engineers who will use Python and SQL

-> A workload for jobs that will run notebooks that use Python, Scala, and SQL

-> A workload that data scientists will use to perform ad hoc analysis in Scala and R

The enterprise architecture team at your company identifies the following standards for Databricks environments:

-> The data engineers must share a cluster.

-> The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.

-> All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a High Concurrency cluster for the jobs.

Does this meet the goal?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B.

A

We need a High Concurrency cluster for the data engineers and the jobs.

Note:

Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.

A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.

https://docs.azuredatabricks.net/clusters/configure.html

The proposed solution of creating a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a High Concurrency cluster for the jobs meets the goal stated in the scenario.

The goal is to create a tiered structure for the Azure Databricks workspace that supports three workloads: data engineers, jobs, and data scientists. The enterprise architecture team has set the following standards for the Databricks environment:

  1. Data engineers must share a cluster.
  2. The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.
  3. All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

Based on these requirements, the proposed solution creates a separate cluster for each of the three data scientists, which satisfies the standard that each data scientist must have their own cluster. These clusters are created as Standard clusters, which provide a balance between cost and performance for individual users.

For the data engineers workload, a High Concurrency cluster is created, which meets the standard that data engineers must share a cluster. High Concurrency clusters are optimized for concurrent workloads and can support multiple users simultaneously.

Finally, a High Concurrency cluster is created for the jobs workload, which meets the requirement that jobs will run notebooks that use Python, Scala, and SQL. The High Concurrency cluster can handle multiple concurrent jobs and provide high performance for the workload.

Therefore, the proposed solution meets the goal stated in the scenario and satisfies the enterprise architecture team's standards for the Databricks environment. The answer is A) Yes.