Tensorflow GPU predicting concurrency python

manu1 · June 12, 2023, 10:48am

I’m running a localhost API in python. The /predict endpoint takes one base64 encoded image which is then processed and predicted by the previously imported saved_model.I’m predicting with my GPU using CUDA. Now the problem is that the model only predicts one image at a time due to the GIL I think, however I’d like to predict multiple images concurrently in case there are multiple requests incoming at the same time. At the moment my GPU is only being utilized to 1-2%. Is there any way to accomplish this in python similar to Tensorflow Serving API which sadly doesn’t properly work with GPU on windows.

This is my code:

from fastapi import FastAPI
import uvicorn
from pydantic import BaseModel
import tensorflow as tf
import numpy as np
import io
from PIL import Image
import base64
import numpy as np
import time

model_path = './saved_model'
model = tf.saved_model.load(model_path)

class PredictionRequest(BaseModel):
    image: str

app = FastAPI()

@app.post("/predict")
async def predict_objects(request: PredictionRequest):
    
    image_data = base64.b64decode(request.image)
    image = Image.open(io.BytesIO(image_data))
    image_np = np.array(image)
    input_tensor = tf.convert_to_tensor(image_np)
    input_tensor = input_tensor[tf.newaxis, ...]
    ts = time.perf_counter()
    detections = model(input_tensor)
    ts2 = time.perf_counter()
    print(int(ts2 * 1000 - ts * 1000))
    detections = detections['detection_boxes'][0].numpy()
    data = []
    
    # processing return data

    return data

if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=8000)