Tensorflow serving latency spikes

Hi there,

I’ve been running into issues with arbitrary spikes in latency when using tensorflow serving. I noticed the issue for tensorflow decision forest (TFDF) gradient boosted tree models, TFDF random forest models and tensorflow deep learning models.

I simulated the issue using the classic penguin classification dataset. The majority of my predictions fall under the 5 ms range. However, I will randomly get a spike to >50 ms, which worries me for putting models into production.

I created a repo here that reproduces the issue along with instructions in the README.

The training code is from tensorflow decision forest’s penguin classification tutorial here. Here is a snippet of the training code:

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

# Specify the model.
model_1 = tfdf.keras.RandomForestModel()

# Optionally, add evaluation metrics.
model_1.compile(
    metrics=["accuracy"])

# Train the model.
# "sys_pipes" is optional. It enables the display of the training logs.
with sys_pipes():
  model_1.fit(x=train_ds)

And here is a snippet of the request code calling TF serving:

data = {
    "instances": [
        {
            "bill_length_mm": 20.0,
            "body_mass_g": 100.0,
            "bill_depth_mm": 100.0,
            "flipper_length_mm": 100.0,
            "island": "Mallorca",
            "sex": "male",
            "year": 2021,
        }
    ]
}


def make_request(data):
    # Define start time of prediction
    start_time = time.time()

    # Define request to fake_model_id
    requests.post(
        "http://localhost:8501/v1/models/fake_model_id_gbt:predict",
        headers={"Content-Type": "application/octet-stream"},
        data=json.dumps(data),
    )

    # Track end time
    end_time = time.time()
    pred_time = (end_time - start_time) * 1000
    # print("prediction time was {} ms".format(pred_time))
    return pred_time

I am running on an M1 macbook air, using TFDF v 0.2.3 and this docker image.

Any ideas as to what could be causing the spikes?

Thank you,
Shayan

2 Likes

Hello Shayan,

Thanks for the very detailed bug report. This is always great when we can replicate bugs :).

If I understand correctly the setup of this benchmark, it is:

  • Running a tf_serving binary in a docker.
  • Send http requests from a python script outside the docker, to the tf_serving http server.

The question is now to figure out where the choke is.

You mentioned observing this slowdown with both TF-DF RF and GBT models as well as deep learning models. Can con confirm? If so, we can likely exclude TF-DF as the source of the choke.

This document (Guida alle prestazioni  |  TFX  |  TensorFlow) talks about performance of tf serving. There might be some interesting ideas to try.

There might be other benchmarks to try. For example this one: GitHub - dwyatte/tensorflow-serving-benchmark: TensorFlow Serving benchmark.

As a test, can you also send multiple examples at each request. If the wall time does not grow linearly with the number of examples, the serving overhead is more expensive than running the model. The size of the spikes could be interesting here too.

In your plot, the mean prediction time is ~6ms (with spikes up to ~70ms). Those values are slow for a DF model. For reference, a GBT should easily run in less than 1µs/example on this dataset. A Random Forest is generally slower. But it should probably not be worse than 10µs/example. In other words, the model probably represents less than 0.0002% of the wall time.

In addition, in your benchmark, you are always using the same example. CPUs are relatively good at predicting patterns, so the model should run even faster.

It is expected that the setup (http request, tf serving, tensorflow) adds some latency. If you care about it, you probably want to run the inference of this model with the c++ API directly (yggdrasil-decision-forests/user_manual.md at main · google/yggdrasil-decision-forests · GitHub).

As an example, I run a benchmark on those models:

3.1. gbt

This is not a valid model. The “gradient_boosted_trees_header.pb” file is missing.

3.2. rf

wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv

sed -i 's/species/__LABEL/g' penguins.csv
sed -i 's/Adelie/1/g' penguins.csv
sed -i 's/Gentoo/2/g' penguins.csv
sed -i 's/Chinstrap/3/g' penguins.csv

CLI= ... # Path to a https://github.com/google/yggdrasil-decision-forests release.
MODEL=fake_model_id_rf/weights/assets/

${CLI}/show_model --model=${MODEL}
${CLI}/benchmark_inference --dataset=csv:penguins.csv --model=${MODEL}

Output:

[INFO benchmark_inference.cc:268] Running the slow generic engine
batch_size : 100  num_runs : 20
time/example(us)  time/batch(us)  method
----------------------------------------
           5.585           558.5  RandomForestGeneric [virtual interface]
          12.367          1236.7  Generic slow engine

So, the Random Forest model runs at 5.58µs/example.

1 Like

Dear Mathieu,

Thank you for the in-depth response and sorry for the late reply. We were trying out a few ideas.

  1. Your understanding of the benchmark is correct.
  2. I found the spikes occurred both with deep learning models and tfdf, but the spikes occurred more often for tfdf (approx 4X per 10000 requests for tfdf vs 1X per 10000 requests for deep learning models).
  3. I do like the idea of using the C++ API directly, but we serve our models in go which would make things a bit tricky.

Ultimately, we ended up finding that one of the main contributors was running on a local docker container. We found the spikes decreased in size/frequency in a gcp container. We did however, find that for very large tfdf models (specifically gbt models), the spikes still occur in a gcp container.

Thank you again!