How to make TFLite handle multiple api calls at simultaneously?

Hi all,

I am performing Cloud Cost Optimization on AWS and am currently exploring deployment of TFLite models on AWS t4g instances which use AWS custom ARM chip - Graviton2. They pack ~2x performance for 1/2 cost when compared to t3 instances. I have created a basic deployment container for InceptionNet-v3 pretained on Image-Net (no fine-tuning) using FastApi.

InceptionNet-v3 tflite model is quantized to float16. I have shared the deployment code.

import tflite_runtime.interpreter as tflite
import numpy as np
from PIL import Image
import copy

def load_model(model_path):
    '''
    Used to load the model as a global variable
    '''
    model_interpreter = tflite.Interpreter(model_path)   
    return model_interpreter

def infer(image, model_interpreter):
    '''
    Runs inference on the input image
    '''
    model_interpreter.allocate_tensors()
    input_details = model_interpreter.get_input_details()[0]
    output_details = model_interpreter.get_output_details()[0]
    input = model_interpreter.tensor(input_details["index"])
    output = model_interpreter.tensor(output_details["index"])
 
    image = image.resize(input_details["shape"][1:-1])
    image = np.asarray(image, dtype=np.float32)
    image = np.expand_dims(image, 0)
    image = image/255
   
    input()[:] = image
    model_interpreter.invoke()
    results = copy.deepcopy(output())
    
    return results

When I run my load testing program (written using Locust - a python load testing package), then I don’t see any error till the user count breaches ‘3’. After that I start seeing Runtime error given below:

RuntimeError: There is at least 1 reference to internal data
      in the interpreter in the form of a numpy array or slice. Be sure to
      only hold the function returned from tensor() if you are using raw
      data access.

I don’'t see the error for every request. It is random in nature.

When I placed the model loading codelines inside infer function. I stopped seeing this error. But, this will load a new model everytime. This is fine while the model size is less. But, after sometime, this will contribute a good amount towards the inference time.

Is there a way to solve this random error while having the TFLite model as a global variable?

Thanks.

EDIT:
I tried out signature runners api. I didn’t get the error but, for some reason my docker container kept on exiting without any error. I think, it was because of lack of memory.

Does signature runner create multiple interpreter instances in different threads when the current intrepreter is being run thus using up all the available RAM?