Decoding of tflite custom object detector output from model trained with mediapipe (MobileNetV2)

Carlos1 · April 29, 2024, 12:42am

Hi all, I trained a custom object detector model with Mediapipe, when I exported it to tflite (ok) and tried to make predictions I obtain two dictionaries as output, one supposed for bboxes and one for scores, the outputs are obtained as follows for one image:

output_details = interpreter.get_output_details()

which gives:

[{'name': 'StatefulPartitionedCall:0',
  'index': 425,
  'shape': array([    1, 12276,     4], dtype=int32),
  'shape_signature': array([    1, 12276,     4], dtype=int32),
  'dtype': numpy.float32,
  'quantization': (0.0, 0),
  'quantization_parameters': {'scales': array([], dtype=float32),
   'zero_points': array([], dtype=int32),
   'quantized_dimension': 0},
  'sparsity_parameters': {}},
 {'name': 'StatefulPartitionedCall:1',
  'index': 423,
  'shape': array([    1, 12276,     4], dtype=int32),
  'shape_signature': array([    1, 12276,     4], dtype=int32),
  'dtype': numpy.float32,
  'quantization': (0.0, 0),
  'quantization_parameters': {'scales': array([], dtype=float32),
   'zero_points': array([], dtype=int32),
   'quantized_dimension': 0},
  'sparsity_parameters': {}}]

code for prediciton:

interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
boxes  = interpreter.get_tensor(output_details[0]['index'])
scores = interpreter.get_tensor(output_details[1]['index'])

boxes:

[[[ 0.01087701 -0.27369365 -0.53198564 -0.8404835 ]
  [ 0.05485853  0.02915781 -1.390534   -1.670182  ]
  [-0.12034623  0.00819616 -0.9961058  -0.395994  ]
  ...
  [-0.3435838  -0.35941318 -1.0712042  -0.43489447]
  [-0.4016505  -0.03572614 -0.67902136 -0.7194235 ]
  [-0.47916242  0.01016152  0.13207799 -0.7979872 ]]]

scores:

[[[0.005811   0.00431303 0.00324296 0.01789892]
  [0.00658012 0.01305784 0.00548336 0.01855727]
  [0.01610166 0.00838473 0.01678689 0.01819396]
  ...
  [0.00505611 0.02350343 0.01970816 0.00919266]
  [0.00427777 0.01386124 0.00888682 0.01396356]
  [0.00742702 0.00696907 0.00702236 0.00696763]]]

my question is how do you decode the output in a format that can be used for inference and visualization, I’m not understanding the structure of the values of boxes (there are negative values), the same with scores and relate them to the classes, in my case I’m trying to predict objects which belongs to one of 3 classes (+ 1 of background). I’ve seen in some examples that tflite object detectors also provide the labels and number of detections in other two dictioneries, but here I only get 2. I trained the detector based on the MobileNetV2 model with an example from mediapipe (by recommendations of google developers since tflite model maker seems to have facing issues and it will moves to mediapipe in the future).

Does anyone has used this model before and docoded the outputs to make predictions with the exported tflite models? is there a way to add more information when exporting the model to tflite and then get the other two mentioned outputs and in a decoded format directly?

Thanks,
any help will be very useful,
best regards
Carlos