I’ve been trying to speed up model initialization for pose-detection using BlazePose with the tfjs runtime and tfjs-backend-webgl. I have tried setting WEBGL_USE_SHAPES_UNIFORMS to true, however, I’m finding that the segmentation mask returns incorrect data when using this setting. In particular, the returned mask is all zeros when using shapes uniforms in combination with WEB_PACK. If I set WEBGL_PACK_DEPTHWISECONV to false, or WEB_PACK to false, then values are written to the mask, however they are corrupted. When I disable WEBGL_USE_SHAPES_UNIFORMS, I get a correct segmentation mask. I have been testing using Chrome and Safari on a 2019 MacBook Pro. I’m using version 2.1.0 of pose-detection and 4.2.0 of tensor-flow.
I’m wondering if this is a known issue and are there any work-arounds?
I’ve discovered some more about this issue. It seems to be related to the MatMul operator when the output has 3 dimensions.
In particular, making this (obviously unusable) modification to the MatMulPackedProgram class seems to generate the correct output: this.enableShapeUniforms = useShapeUniforms(this.outputShape.length) && this.outputShape.length != 3;
Not sure if this helps?
@Lin_Chen FYI / if you have any thoughts on this one given you are our team’s WebGL expert
Thanks @Ben_Cole for the report.
Hi @Ben_Cole , I haven’t seen such a problem and I tested our BlazePoseDetector mdoel in TensorFlow.js Model Benchmark with checking ‘use shapes uniform’. Do you mind sharing your code to reproduce the error?
By the way, you could also try parallel compile feature to accelerate your first run, as Model:body-segmentation browser freezes for ~7-9 seconds in initial run · Issue #7026 · tensorflow/tfjs · GitHub
Hi @Lin_Chen, let me see if I can pull together a simple example of the issue. (It might take me a little time to do this.) I’m curious if the Benchmark test also tests the mask output (I’m setting smoothSegmentation: true, enableSegmentation: true) and I’m wondering if image resolution has any impact? (I’m using 576x1024 images for input.)
I did previously try some experiments setting ENGINE_COMPILE_ONLY, but it didn’t seem like all nodes got compiled on the warm up. It’s difficult to know if there are early outs on some paths if no skeleton is detected, for instance. If you have some example code of how to apply this setting for the pose detector in particular, that would be really helpful.
I think you are using pose-detection API, which involves two models BlazePoseDetector and BlazePoseLandmark, so I just tested the two models in the tool. The correctness tests are passed, but they are using zeros as model inputs, so this tests may not capture correctness issues.
If you didn’t set/change modelType, it would be the same model. By default, the pose-detection API is using the two models: ‘TensorFlow Hub’ and ‘TensorFlow Hub’.
It would be great to have your code to reproduce the problem and then I could help debug and check if we could apply the parallel compilation to it.
Thank you. I think it will be a few days before I’ll have time to build out a standalone test case. However, I have determined that it one (or both) of two matmuls with the same shape that is causing the issue. The code gives the correct output if I use this code in MatMulPackedProgram:
this.enableShapeUniforms = useShapeUniforms(this.outputShape.length) && !(this.outputShape == 8 && this.outputShape == 65536);
I will also look more at parallel compilation.
I’ve played around a bit with parallel compilation, but it doesn’t seem to be especially helpful in this case. If I enable the ENGINE_COMPILE_ONLY flag and predict the models directly, this compilation process is certainly quick. However, the first call to detector.estimatePoses after doing this compilation is not significantly quicker, although it is slightly quicker. Notably, if I don’t set ENGINE_COMPILE_ONLY, but predict the models directly before calling detector.estimatePoses, then estimatePoses does runs more quickly. However, in all these cases the total combined time is roughly the same as just calling estimatePoses the first time.
Sorry for the late reply!
You are right, because there was a performance bug of parallel compile when you try it. The bug is fixed now, as Parallel shader compilation is broken · Issue #7577 · tensorflow/tfjs · GitHub, you would see performance improvements if you use the latest TFJS version.