You are working for a large pharmaceutical company, and their research laboratory generates a huge amount of test results daily.
You, as a data scientist is tasked to build an ML solution to ingest the batches of data and generate predictions which can be further used by the research engineers.
You want to build a machine learning pipeline deployed as a web service, which can be fed with the daily batches of data.
Which of the following should you consider as best practice while building your pipeline?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: B.
Option A is incorrect because the compute target to be used can be configured for each pipeline step, although it is a common practice to set it at the pipeline level.
Option B is CORRECT because by keeping the scripts of each pipeline step in a separate source directory helps saving resources, because if anything changes in a particular step's folder, only the affected step will be re-run.
Although nothing prevents you from keeping all your code in one single folder, this is not recommended.
Option C is incorrect because allowing to reuse the results of previous runs can radically reduce the resource need and execution time of the pipeline.
Allow reuse whenever it is applicable.
Option D is incorrect because interactive authentication (InteractiveLoginAuthentication) should be used only for development/testing purposes.
For production scenario use ServicePrincipalAuthentication.
Diagram - Working with pipelines:
Reference:
As a data scientist tasked with building an ML solution for a pharmaceutical company's research laboratory, you want to ensure that the pipeline you build is efficient, scalable, and easy to maintain. Here are some best practices you should consider:
A. The pipeline steps must use the same compute target: This best practice recommends that all the steps in the machine learning pipeline use the same compute target. This ensures that the data processing and machine learning steps are executed on the same infrastructure, providing consistency in performance and reducing the chances of errors due to differences in compute environments.
B. Scripts for each pipeline step should be kept in separate folders: Keeping scripts for each pipeline step in separate folders ensures that the pipeline is organized and easy to maintain. It also helps to keep track of which scripts are used for which steps, making debugging and troubleshooting easier.
C. Pipeline steps should be configured not to allow reuse: Configuring pipeline steps not to allow reuse ensures that the pipeline is executed correctly each time. This is because each step generates outputs that are unique to a particular run, and reusing these outputs can cause unexpected behavior in subsequent pipeline runs.
D. When publishing the pipeline, interactive authentication is the recommended way of authentication: This best practice recommends using interactive authentication when publishing the pipeline. Interactive authentication prompts the user for credentials at runtime, providing an additional layer of security. It also allows users to specify different credentials for different pipeline runs, making it easy to manage access to resources.
In summary, when building a machine learning pipeline deployed as a web service for a pharmaceutical company's research laboratory, you should consider best practices such as using the same compute target for all pipeline steps, keeping scripts for each pipeline step in separate folders, configuring pipeline steps not to allow reuse, and using interactive authentication when publishing the pipeline.