CPU load settings tfjs-node

gotostereo · February 23, 2022, 11:27pm

Hi! Can i set more load on CPU? By default ~25% on multicore with @tensorflow/tfjs-node": "^3.12.0

Jason · February 24, 2022, 7:18pm

Maybe @Matthew_Soulanille knows the answer to this one?

Matthew_Soulanille · February 25, 2022, 11:25pm

Hi @gotostereo . Are you observing 25% of your threads in use at 100% or 25% usage on all your threads?

I initially thought this might be caused by a disk bottleneck in your code, such as when reading training data. However, I just tried our mnist-node example which stores all training data in memory, where I saw only 50% cpu usage on each thread (16 thread machine). This might be a bug with tfjs-node, or perhaps there’s an inefficiency in the fitLoop function (which trains the model) when it’s using the node backend.

I also tried the same example with tfjs-node-gpu and saw far less speedup than expected (only ~1.5) and only about 25% gpu utilization. I saw two node processes, one at 20% cpu, which I think is feeding the GPU, and the main one at 120%. This indicates to me that there’s a bottleneck in how we’re distributing work to the GPU (or threads in your case), but I’ll have to look into it more to be sure.

Looking at performance profiles, most of the time is spent on NodeJSKernelBackend.executeSingleOutput and NodeJSKernelBackend.getInputTensorIds. I’m guessing there’s too much overhead on these functions, and that’s what’s slowing down the rest of the threads. I’ll see if anyone else on the team has thoughts on this.

gotostereo · February 26, 2022, 9:46am

Windows 10 x64, latest node js, intel 3770k, all threads work on ~25% .It would be great if it could be controlled.

Matthew_Soulanille · March 8, 2022, 10:19pm

Hi @gotostereo.

Unfortunately, I think this performance is a result of how tfjs-layers works. tfjs-layers runs each layer sequentially, and this happens in a single thread even though the ops in a layer may be distributed to different threads. Until we have a way to distribute this work better, performance will be bottlenecked by the main thread. Some of the processing in Layers also happens in javascript, so you might try running the node profiler to see if there’s a particular layer that’s causing the slowdown.

This is less of a problem for inference, where using a graph model can more evenly distribute the work, but tfjs does not yet support training graph models.