You manage a team of data scientists who use a cloud-based backend system to submit training jobs.
This system has become very difficult to administer, and you want to use a managed service instead.
The data scientists you work with use many different frameworks, including Keras, PyTorch, theano, Scikit-learn, and custom libraries.
What should you do?
Click on the arrows to vote for the correct answer
A. B. C. D.D.
The best option for managing a team of data scientists who use various frameworks, including Keras, PyTorch, Theano, Scikit-learn, and custom libraries, would be to use a managed service that is flexible enough to receive jobs using any framework. Option A, using the AI Platform custom containers feature, would be the best choice.
Option A: Use the AI Platform custom containers feature to receive training jobs using any framework.
AI Platform provides a custom container feature that allows users to specify their own container image to use for training jobs. This means that the data scientists on the team can create their own custom container images that include the specific libraries and frameworks they need to use for their jobs. These custom containers can be stored in a container registry and referenced when submitting training jobs to AI Platform. This approach offers a high degree of flexibility, as it allows the team to use any framework they need.
Option B: Configure Kubeflow to run on Google Kubernetes Engine and receive training jobs through TF Job.
Kubeflow is an open-source platform that provides a set of tools to build and deploy machine learning workflows on Kubernetes. While it is a powerful tool for managing machine learning workflows, it is primarily designed to work with TensorFlow. If the team primarily uses TensorFlow for their work, this option might be worth considering. However, since the team uses many different frameworks, this option might be limiting.
Option C: Create a library of VM images on Compute Engine and publish these images on a centralized repository.
While Compute Engine provides a great deal of flexibility and control over virtual machines, it is not the best choice for managing training jobs for a team of data scientists. This option would require creating a separate VM image for each framework that the team uses, which can be time-consuming and difficult to manage. Additionally, it would be difficult to keep these VM images up-to-date with the latest versions of each framework.
Option D: Set up Slurm workload manager to receive jobs that can be scheduled to run on your cloud infrastructure.
Slurm is a workload manager that is commonly used in high-performance computing environments to manage batch jobs. While it can be used to manage machine learning workloads, it is primarily designed for managing compute-intensive scientific simulations. This option might be overkill for managing the training jobs for a team of data scientists, and it might not offer the flexibility needed to support the many different frameworks the team uses.