Cost-effective ML Pipeline on Azure: Utilizing Databricks Cluster for Training Model

Designing and Implementing a Data Science Solution on Azure: DP-100 Exam Answer

Question

You are building a ML pipeline using the ML Designer.

Your pipeline consists of the following major modules: Get data, Split data, Clean Missing Data, Train Model, Score Model, Evaluate Model.

You want to run your pipeline on an ML Compute Cluster, but for the Train Model step you want to use the massive Databricks Cluster you set up earlier for another project.

You want to solve it in the most cost-effective way.

Which option should you follow?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: C.

Option A is incorrect because a default compute target can be set at pipeline level and you can set different targets for the steps individually.

Databricks Cluster is a valid option.

Option B is incorrect because while you can use the Databricks Cluster for the whole pipeline, it is not the most cost-effective way.

You can use different targets for different steps.

Option C is CORRECT because the best way to solve the task is setting the low-cost compute target at pipeline level as default, and changing the target for only steps that require so.

Databricks Cluster is a valid option.

Option D is incorrect because this is not a way of building pipelines.

The problem can be solved within a single pipeline, by setting the compute target for steps individually.

Reference:

The best option in this scenario is C, where you set the default compute target for the pipeline to the ML Compute Cluster and set the Databricks Cluster for the training step.

Here's why:

Option A suggests setting the pipeline's default compute target to ML Compute Cluster and implies that Databricks cannot be used with pipelines. However, this is not entirely true. While Databricks cannot be used as a default compute target for a pipeline, it can be used as a custom compute target for specific modules in the pipeline.

Option B suggests setting the Databricks Cluster as the default compute target for the pipeline and running the entire pipeline on Databricks. While this is possible, it may not be the most cost-effective solution as Databricks can be expensive to run. Also, it is not necessary to run the entire pipeline on Databricks, only the training step needs to be run on Databricks.

Option D suggests creating two pipelines, one for running the Compute Cluster for the preparatory steps, and another one for the training/scoring tasks on the Databricks. While this can work, it is not the most efficient way to solve the problem. It is better to have a single pipeline that can handle all the steps in the workflow, including the training step on Databricks.

Therefore, option C is the best choice. By setting the default compute target for the pipeline to the ML Compute Cluster and setting the Databricks Cluster for the training step, you can ensure that the pipeline runs on the most cost-effective compute target while using the Databricks Cluster for the computationally intensive training step. This will help you to optimize your costs while achieving the desired performance.