Hi everyone, this is my first post here.

I implemented a conjugate gradient minimizer in tensorflow. The coefficient matrix is much wider than tall, so the goal is to solve the system without generating the full matrix (i.e. I use K and K.T separately, I don’t actually compute K*K.T). I am not using sparse matrices

It works, but seems to only take advantage of two threads. This is true both on MacOS and Linux.

This function is loaded from a module, it carries out one CG step . SIGMA is the unknown vector :

```
@tf.function
def CGSTEP(K, KT, SIGMA, alpha, pk, pkm1, gkm1, gkm2, rk, rkm1, RHS, resid) :
alpha.assign(tf.norm(gkm1)**2 / tf.norm(tf.matmul(K,pk))**2)
SIGMA.assign( tf.add(SIGMA, alpha*pk ) )
rk.assign(rkm1 - alpha*tf.matmul(K,pk))
gkm2.assign(gkm1)
gkm1.assign(tf.matmul(KT,rk))
pkm1.assign(pk)
rkm1.assign(rk)
resid.assign(RHS - tf.matmul(K,SIGMA))
return
```

I use the function like this :

```
r0 = tf.Variable(RHS - tf.matmul(K,SIGMA))
resid = tf.Variable(r0)
g0 = tf.matmul(KT,r0)
gkm1 = tf.Variable(g0)
gkm2 = tf.Variable(g0)
rk = tf.Variable(r0)
rkm1 = tf.Variable(r0)
pk = tf.Variable(g0)
pkm1 = tf.Variable(g0)
alpha = tf.Variable(tf.norm(gkm1)**2 / tf.norm(tf.matmul(K,pk))**2)
KMAX = 100
k = 1
while k < KMAX :
if k > 1 :
pk.assign(gkm1 - (- tf.norm(gkm1)**2 / tf.norm(gkm2)**2)*pkm1)
#
CGSTEP(K, KT, SIGMA, alpha, pk, pkm1, gkm1, gkm2, rk, rkm1, RHS, resid)
k += 1
```

Up front I set

```
tf.config.threading.set_inter_op_parallelism_threads(int(os.environ["NUMEXPR_NUM_THREADS"]))
tf.config.threading.set_intra_op_parallelism_threads(int(os.environ["NUMEXPR_NUM_THREADS"]))
```

where the number of threads is either 10 (on my iMac) or 32 (on Linux).

Maybe I have a bottleneck, but the CPU usage is exactly 200 % on both platforms, feels like I am doing something else wrong.

I am following this development : https://sites.stat.washington.edu/wxs/Stat538-w03/conjugate-gradients.pdf

Thanks in advance,

Randy