Request to Tensorflow Serving Server got timeout error for models using grpc python API

Synopsis

10 models for computer vision was deployed to tensorflow serving server (TSS) running on Ubuntu 22.04.
TSS installed as system service on the dedicated server having MSI RTX3060 12 Gb on board. System configuration and TSS service file below.

Problem

Request send via the tensorflow-serving grpc api randomly got status code DEADLINE_EXCEEDED or UNAVAILABLE sometimes on the first request but more often after some number (1…4) of successful requests or after some period of inactivity ( 1 hour or more ).
No OMM or service dump happened. GPU memory occupation is near 6 Gb. Service logs seems to have no problem indication, no warnings also (debug level 3).
Some experiments and results in detail see below.

System

  • [OS] Ubuntu 22.04.1 LTS
  • [CPU] 11th Gen Intel(R) Core™ i5-11400F @ 2.60GHz
  • [GPU] MSI GeForce RTX 3060 12G
  • [RAM] 16Gb
  • [SSD] NVME 1Tb
  • [Tensorflow] Version 2.9.1
  • [CUDA] Version 11.7
  • [CUDNN] Version 8.6.0.163-1+cuda11.8
  • [TensorRT] Not used while building tensorflow-serving

TSS service file

[Unit]
Description=Tensorflow Serving Service
After=network-online.target

[Service]
User=root
Environment="PATH=/usr/local/bin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/cuda/bin:/usr/bin/model_servers"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="TF_FORCE_GPU_ALLOW_GROWTH=true"
ExecStart=/usr/bin/model_servers/tensorflow_model_server --port=8504 --model_config_file=/mnt/data/models/export/frman.conf

[Install]
WantedBy=multi-user.target

Hints

  1. Tensorflow serving service intializes after network on host is available․
  2. Service configured to allocate gpu memory when it needed (environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true).

Hypotesis - actions

  1. The problem is lost packets on the network
    • Requests to TSS was monitored with wireshark from client side and with GNOME system monitor in server side. NO problem detected.
    • Timeout value for single request on client side of tensorflow_serving.apis.prediction_service_pb2_grpc.PredictionServiceStub object increased .

      stub.Predict(request, timeout * len(images))
    • The grpс channel is being checked for readiness before data transmission to begin
    • Interceptors for grpc requests added. Procedure repeat request with exponential backoff but nevertheless it returned status code DEADLINE_EXCEEDED or UNAVAILABLE randomly.
		options_grpc = [
			('grpc.max_send_message_length', 100 * 1024 * 1024),
			('grpc.max_receive_message_length', 100 * 1024 * 1024),
			('grpc.default_compression_algorithm', grpc.Compression.Gzip),
			('grpc.grpc.default_compression_level', CompressionLevel.high)
		]
		interceptors = (
			RetryOnRpcErrorClientInterceptor(
				max_attempts=5,
				sleeping_policy=ExponentialBackoff(init_backoff_ms=1000, max_backoff_ms=32000, multiplier=2),
				status_for_retry=(grpc.StatusCode.DEADLINE_EXCEEDED, grpc.StatusCode.UNKNOWN, grpc.StatusCode.UNAVAILABLE),
			),
		)
		channel = grpc.insecure_channel(f"{host}:{port}", options=options_grpc)
		grpc.channel_ready_future(channel).result(timeout=5)
		channel = grpc.intercept_channel(channel, *interceptors)
  1. The problem is in the models themselves
    Was noticed that problems mostly arise with models with U2net architecture. U2net uses some custom operations and was assumed that the problem arises from the fact that at the first request time outed because these operations for loading ops are too long.
    That was found on the TSS service log file. To resolve this we try:

    • Add a warm-up that kind of models at the service startup, so all custom network operations loaded into memory before inference.
    • Custom operations in U2Net was eliminated by converting the models to onnx format and then to tensorflow saved model format. So, after that there is no need for the TSS to load network custom ops at startup. Model warm up added also.
  2. The problem with the lack of memory - There was noticed another alarming message in TSS service log:

tensorflow_model_server[62029]: 2022-12-20 14:10:13.900599: W external/org_tensorflow/tensorflow/tsl/framework/bfc_allocator.cc:360] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.

It looks line not enough memory for inference on GPU. To solve this we try to limit image batch size to 1(one) image for TSS and to set environment variable - TF_FORCE_GPU_ALLOW_GROWTH=true.
Memory consumption not increased after that. But random timeout (DEADLINE_EXCEEDED) error not dissapied.

Conclusion - Issue NOT Resolved

Thus, the problem is still in place especially when TSS inference segmentaion models (like U2Net).
Not Root of the problem was found.
Difficult to reproduce error as of its random type.

What else should be worth to check or configure to resolve the issue?