Optimizing Azure ML Pipeline Design: Best Practices

Best Practices for Azure ML Pipeline Design

Prev Question Next Question

Question

As part of a ML team, your task is to build a workflow to chain several steps of the ML process in order to automate the series of tasks and to allow automation and reuse.

You are going to use the Pipeline feature of the Azure ML stack.

You have planned the main steps in the pipeline, and now you are about to optimize how to organize your files around your pipeline.

Which two of the following options are true and should be used as best practice in pipeline design?

Answers

A. Store scripts and dependencies of the whole pipeline in a single source directory in order to take the advantage of data reuse

B. Store scripts and dependencies for each step in separate source directories in order to take advantage of data reuse

C. Store scripts and dependencies of the whole pipeline in a single source directory in order to reduce the size of the snapshots for given steps

D. Store scripts and dependencies for each step in separate source directories in order to reduce the size of snapshot for given steps

E. Force output regeneration for steps in a run by setting the allow_reuse to False.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answers: B and D.

Option A is incorrect because if all the scripts belonging to the pipeline are kept in a single directory, every time the content of the directory changes, it will force all steps to rerun, i.e.

no data reuse.

Option B is CORRECT because Steps in a pipeline can be configured to reuse results from their previous runs if the step's scripts, dependencies, inputs etc.

are unchanged.

Keeping the files for each step in separate folders ensures that if only files of any step changes, only the output data of the given step is regenerated, while all the others' remain unchanged and reusable.

Option C is incorrect because if the scripts and dependencies of the whole pipeline are kept in a single directory, snapshotting of the steps takes an unnecessary high amount of time and storage.

Option D is CORRECT because keeping the scripts and dependencies of each step in separate directories helps minimize the resources necessary for snapshotting the steps.

Option E is incorrect because the default setting for allow_reuse is True which means that the results of the previous step run is reused until the content of the source_directory is unchanged.

In order to save time and resources, changing this default behavior is only recommended if there is a special reason why the results must be re-generated.

Reference:

When designing a pipeline, it's important to think about how to organize the files and dependencies that are required for each step in the pipeline. Here are the best practices for organizing files and dependencies in a pipeline:

Store scripts and dependencies for each step in separate source directories in order to take advantage of data reuse. This is because separating the scripts and dependencies for each step makes it easier to identify which dependencies are required for each step. Additionally, it allows you to reuse data across multiple steps, as you can simply reference the data that was generated in a previous step. This can save time and reduce the complexity of your pipeline.
Store scripts and dependencies of the whole pipeline in a single source directory in order to reduce the size of the snapshots for given steps. This is because the size of the snapshot for a given step will include all the dependencies that were used in the previous steps. If you store all the scripts and dependencies in a single source directory, it will reduce the size of the snapshot for each step, as only the dependencies that are required for that particular step will be included.

Therefore, options B and C are the correct answers for this question.

In addition, it's also important to note that you should force output regeneration for steps in a run by setting the allow_reuse to False. This is because if you allow reuse of the output from a previous run, it can lead to inconsistent results if the data has changed since the previous run. By setting allow_reuse to False, you ensure that the data is regenerated for each run, which can help prevent errors and ensure consistent results. Option E is also a correct answer for this question.

Prev Question Next Question