Implementing a Data Science Solution on Azure: Efficiently Updating ML Models

Efficiently Updating ML Models

Question

You have created a ML pipeline which accepts daily transaction data to make predictions for the forthcoming period.

The experience shows that patterns in the data slightly change from time to time, that's why you decide to re-train the model weekly, which means that, from week to week, you have to feed the pipeline with new training data.

You want to implement a solution which fulfils this requirement with the least possible effort, while ensuring operation of the inference pipeline.

How can you achieve this goal?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: A.

Option A is CORRECT because inference pipelines can be configured to accept parameters which can lend them some dynamic behavior.

One typical use can be feeding different sets of data by passing the source of the data as a parameter.

In Azure ML, the PipelineParameter object is designed for this purpose.

Option B is incorrect because inference pipelines actually can be configured to accept parameters, for example to be run with different sets of data.

In Azure ML, using the PipelineParameter is the way of achieving this.

PipelineData is a valid parameter, but it is designed to pass data between pipeline steps.

Option C is incorrect because offline re-training the pipeline and re-publishing it from week to week requires a lot more effort than configuring and deploying the pipeline once and using it in dynamic mode, using parameters.

Option D is incorrect because while PipelineParameter is the way of dynamically setting dataset for a pipeline, it must be defined at pipeline level, not at step level.

Reference:

The best answer for this scenario is A. In the pipeline definition, create a PipelineParameter; publish the modified pipeline; pass the actual parameter value in the REST call as JSON document when new training data is available.

Explanation:

To fulfill the requirement of re-training the model weekly, the solution should be able to handle new training data in an automated and efficient manner without disrupting the inference pipeline. The suggested solution requires creating a PipelineParameter in the pipeline definition, which is a parameterized value that can be passed to the pipeline at runtime. This parameter will allow the pipeline to accept new training data without modifying the pipeline definition or redeploying the pipeline.

Once the pipeline parameter is defined in the pipeline, it can be published and made available for use. Then, whenever new training data becomes available, the actual parameter value can be passed to the pipeline in the REST call as a JSON document. This can be done programmatically, so that the process can be automated and requires minimal human intervention.

This approach provides a flexible and scalable solution, as the pipeline can be used for inference without modification, while new training data can be easily integrated by passing the parameter value in the REST call. This also allows the model to adapt to changing patterns in the data without requiring significant manual intervention or redeployment.

Option B is not correct because creating a PipelineData would not be appropriate in this scenario. PipelineData is used to represent the output of one pipeline and the input of another, and it would not be suitable for passing new training data to the pipeline.

Option C is not the best solution because redeploying the modified inference pipeline weekly would require significant manual effort and would not be efficient. Additionally, it would disrupt the inference pipeline, which would impact the performance of the solution.

Option D is not the best solution because creating a PipelineParameter in a PythonScriptStep would not be efficient. The pipeline would need to be modified each time new training data becomes available, which is not scalable or practical.