Custom conjugate gradient minimizer only uses two threads

Hi everyone, this is my first post here.

I implemented a conjugate gradient minimizer in tensorflow. The coefficient matrix is much wider than tall, so the goal is to solve the system without generating the full matrix (i.e. I use K and K.T separately, I don’t actually compute K*K.T). I am not using sparse matrices

It works, but seems to only take advantage of two threads. This is true both on MacOS and Linux.

This function is loaded from a module, it carries out one CG step . SIGMA is the unknown vector :

def CGSTEP(K, KT, SIGMA, alpha, pk, pkm1, gkm1, gkm2, rk, rkm1, RHS, resid) :
	alpha.assign(tf.norm(gkm1)**2 / tf.norm(tf.matmul(K,pk))**2)
	SIGMA.assign( tf.add(SIGMA, alpha*pk ) )
	rk.assign(rkm1 - alpha*tf.matmul(K,pk))
	resid.assign(RHS - tf.matmul(K,SIGMA))

I use the function like this :

r0 = tf.Variable(RHS - tf.matmul(K,SIGMA))
resid = tf.Variable(r0)
g0 = tf.matmul(KT,r0)
gkm1 = tf.Variable(g0)
gkm2 = tf.Variable(g0)
rk = tf.Variable(r0)
rkm1 = tf.Variable(r0)
pk = tf.Variable(g0)
pkm1 = tf.Variable(g0)
alpha = tf.Variable(tf.norm(gkm1)**2 / tf.norm(tf.matmul(K,pk))**2)

KMAX = 100
k = 1 

while k < KMAX :
	if k > 1 :
		pk.assign(gkm1 - (- tf.norm(gkm1)**2 / tf.norm(gkm2)**2)*pkm1)
	CGSTEP(K, KT, SIGMA, alpha, pk, pkm1, gkm1, gkm2, rk, rkm1, RHS, resid)	
	k += 1 

Up front I set


where the number of threads is either 10 (on my iMac) or 32 (on Linux).

Maybe I have a bottleneck, but the CPU usage is exactly 200 % on both platforms, feels like I am doing something else wrong.

I am following this development :

Thanks in advance,