Solving Latency Issues for ML Model on AI Platform with GKE

Improving Serving Latency for ML Model on AI Platform with GKE

Question

You developed an ML model with AI Platform, and you want to move it to production.

You serve a few thousand queries per second and are experiencing latency issues.

Incoming requests are served by a load balancer that distributes them across multiple Kubeflow CPU-only pods running on Google Kubernetes Engine (GKE)

Your goal is to improve the serving latency without changing the underlying infrastructure.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

D.

To improve serving latency without changing the underlying infrastructure, we can try optimizing the TensorFlow Serving configuration. Here are the four options provided in the answer choices and their explanations:

A. Significantly increase the max_batch_size TensorFlow Serving parameter: The max_batch_size parameter controls the maximum number of requests that can be batched together by TensorFlow Serving. Increasing this parameter can help reduce the number of round trips between the client and server, which can improve serving latency. However, increasing the batch size too much can lead to higher memory usage and longer processing times for each batch. Therefore, this option may not always be the best choice for reducing latency, and we need to consider the tradeoff between batch size and latency.

B. Switch to the tensorflow-model-server-universal version of TensorFlow Serving: The tensorflow-model-server-universal is a version of TensorFlow Serving that includes support for both CPU and GPU-based hardware. However, since the underlying infrastructure is CPU-only, switching to this version may not provide any performance benefits. Therefore, this option may not be the best choice for reducing latency in this scenario.

C. Significantly increase the max_enqueued_batches TensorFlow Serving parameter: The max_enqueued_batches parameter controls the number of batches that can be enqueued by TensorFlow Serving before blocking incoming requests. By increasing this parameter, we can reduce the number of blocked requests and improve serving latency. However, increasing this parameter too much can lead to higher memory usage and longer processing times for each batch. Therefore, we need to find the right balance between enqueued batches and latency.

D. Recompile TensorFlow Serving using the source to support CPU-specific optimizations. Instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes: Recompiling TensorFlow Serving with CPU-specific optimizations can help improve the performance of the model on the underlying infrastructure. Additionally, instructing GKE to choose an appropriate baseline minimum CPU platform for serving nodes can ensure that the nodes have the necessary CPU capabilities to run the model efficiently. These optimizations can help reduce serving latency without changing the underlying infrastructure.

Therefore, the best option for improving serving latency without changing the underlying infrastructure is option D, which involves recompiling TensorFlow Serving with CPU-specific optimizations and instructing GKE to choose an appropriate baseline minimum CPU platform for serving nodes.