You have created an ML pipeline of eight steps, using Python SDK.
While tuning the script of steps3 and 6, you submit the pipeline for execution several times.
The scripts and data definitions of the other steps didn't change, but still you notice that all the steps rerun each time, and you experience long execution times.
You suspect that the pipeline is not reusing the output of steps, and you decide to set allow_reuse to True, for each step.
Will it probably solve the problem?
Click on the arrows to vote for the correct answer
A. B.Answer: B.
Option A is incorrect because the allow_reuse parameter of the pipeline steps is set to True, by default.
Setting it explicitly to True in your code won't change the situation, i.e.
it won't solve the problem.
Option B is CORRECT because if you want to ensure that steps only rerun when their underlying data definition or scripts change, put their scripts to separate folders for each step.
Otherwise, you might experience unnecessary reruns.
Reference:
Setting allow_reuse
to True
for each step in an ML pipeline means that if the step has already executed successfully, then the pipeline will reuse the output of that step rather than rerunning it.
In the given scenario, when the script and data definitions of steps 1-2 and 4-5-7-8 didn't change but steps 3 and 6 were tuned, it is likely that steps 1-2 and 4-5-7-8 were rerun due to the default behavior of not allowing reuse. This can result in longer execution times, especially if these steps take a significant amount of time to execute.
By setting allow_reuse
to True
, the pipeline will check if the output of each step has already been generated, and if so, it will reuse that output rather than rerunning the step. This can potentially save a lot of execution time, especially if the steps being reused are computationally expensive.
Therefore, the correct answer is A. Yes, setting allow_reuse
to True
for each step will probably solve the problem of all steps rerunning each time and experiencing long execution times.