Designing and Implementing a Data Science Solution on Azure: Best Practices for Pipeline Design

Best Practices for Pipeline Design

Prev Question Next Question

Question

As part of a ML team, your task is to build a workflow to chain several steps of the ML process in order to automate the series of tasks and to allow automation and reuse.

You are going to use the Pipeline feature of the Azure ML stack.

You have planned the main steps in the pipeline, and now you are about to optimize how to organize your files around your pipeline.

Which two of the following options are true and should be used as best practice in pipeline design?

Answers

A. Store scripts and dependencies of the whole pipeline in a single source directory in order to take the advantage of data reuse

B. Store scripts and dependencies for each step in separate source directories in order to take advantage of data reuse

C. Store scripts and dependencies of the whole pipeline in a single source directory in order to reduce the size of the snapshots for given steps

D. Store scripts and dependencies for each step in separate source directories in order to reduce the size of snapshot for given steps

E. Force output regeneration for steps in a run by setting the allow_reuse to False.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answers: B and D.

Option A is incorrect because if all the scripts belonging to the pipeline are kept in a single directory, every time the content of the directory changes, it will force all steps to rerun, i.e.

no data reuse.

Option B is CORRECT because Steps in a pipeline can be configured to reuse results from their previous runs if the step's scripts, dependencies, inputs etc.

are unchanged.

Keeping the files for each step in separate folders ensures that if only files of any step changes, only the output data of the given step is regenerated, while all the others' remain unchanged and reusable.

Option C is incorrect because if the scripts and dependencies of the whole pipeline are kept in a single directory, snapshotting of the steps takes an unnecessary high amount of time and storage.

Option D is CORRECT because keeping the scripts and dependencies of each step in separate directories helps minimize the resources necessary for snapshotting the steps.

Option E is incorrect because the default setting for allow_reuse is True which means that the results of the previous step run is reused until the content of the source_directory is unchanged.

In order to save time and resources, changing this default behavior is only recommended if there is a special reason why the results must be re-generated.

Reference:

When designing a workflow using the Pipeline feature of Azure ML stack, it is important to consider how to organize files around the pipeline. Here are the explanations of each option provided:

A. Store scripts and dependencies of the whole pipeline in a single source directory in order to take advantage of data reuse: This option is not recommended as it can create dependency issues and can make it difficult to maintain and update the pipeline. Storing all the scripts and dependencies in a single source directory can lead to larger snapshot sizes, making it harder to manage and reuse data.

B. Store scripts and dependencies for each step in separate source directories in order to take advantage of data reuse: This option is a best practice in pipeline design. By storing scripts and dependencies for each step in separate source directories, it makes it easier to maintain and update the pipeline. It also helps to reduce snapshot size and allows for greater flexibility in reusing data.

C. Store scripts and dependencies of the whole pipeline in a single source directory in order to reduce the size of the snapshots for given steps: This option is not recommended as it can lead to dependency issues and can make it difficult to maintain and update the pipeline. Additionally, storing all the scripts and dependencies in a single source directory can lead to larger snapshot sizes, making it harder to manage and reuse data.

D. Store scripts and dependencies for each step in separate source directories in order to reduce the size of snapshot for given steps: This option is a best practice in pipeline design. By storing scripts and dependencies for each step in separate source directories, it helps to reduce snapshot size and allows for greater flexibility in reusing data. This can lead to faster processing times and more efficient use of resources.

E. Force output regeneration for steps in a run by setting the allow_reuse to False: This option may be useful in some cases but it is not a best practice in pipeline design. Setting allow_reuse to False can slow down processing times and increase resource usage. Instead, it is recommended to design the pipeline to reuse data as much as possible, while allowing for flexibility in case any step needs to be regenerated.

In summary, the best practice in organizing files around the pipeline is to store scripts and dependencies for each step in separate source directories. This allows for greater flexibility in reusing data and can lead to faster processing times and more efficient use of resources.

Prev Question Next Question