I am running the following code to test embedding_lookup.
```python
# comman…d:
# python3 -m pdb embtest.py --features=1000 --nnz=30 --batch=128
#
# error:
# *** tensorflow.python.framework.errors_impl.ResourceExhaustedError:
# Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.
#
import tensorflow as tf
import numpy as np
import sys
import os
import time
def measure(params, sp_ids, steps, thr):
res = tf.nn.embedding_lookup([params[0:thr],params[thr:]], sp_ids, None, name="TEST1")
print("Finished test")
return res
if __name__ == "__main__":
import sys
import argparse
parser = argparse.ArgumentParser(
description="Measure the performance of tensorflow embeddingbag using tf.nn.embedding" )
parser.add_argument("--features", type=int, default=10)
parser.add_argument("--em", type=int, default=2)
parser.add_argument("--nnz", type=int, default=2)
parser.add_argument("--batch", type=int, default=4)
parser.add_argument("--steps", type=int, default=1)
parser.add_argument("--warmups", type=int, default=0)
args = parser.parse_args()
features = args.features
em = args.em
nnz = args.nnz
batch = args.batch
steps = args.steps
warmups = args.warmups
sp_ids = np.random.randint(0, features, (batch * nnz,))
res = tf.zeros([batch, em])
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="grpc://"+os.environ["TPU_IP"])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
print(" ")
tpus = tf.config.list_logical_devices('TPU')
print("There are {} tpu logical devices".format(len(tpus)))
print(tpus[0])
with tf.device('TPU:0'):
params = tf.random.uniform([features, em])
res = measure(params, sp_ids, tf.constant(steps), features//2)
print(res)
```
But got the following error:
```bash
hongzhang@shan-tf1:~$ python embtest.py --features=1000 --nnz=30 --batch=128
Eager execution : True
2020-10-05 08:23:42.244623: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-05 08:23:42.250601: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2300000000 Hz
2020-10-05 08:23:42.251595: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c1dde0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-05 08:23:42.251631: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-10-05 08:23:42.263068: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.178.175.58:8470}
2020-10-05 08:23:42.263113: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:38651}
2020-10-05 08:23:42.279709: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.178.175.58:8470}
2020-10-05 08:23:42.279743: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:38651}
2020-10-05 08:23:42.280176: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:405] Started server with target: grpc://localhost:38651
There are 8 tpu logical devices
LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')
Traceback (most recent call last):
File "embtest.py", line 84, in <module>
t1 = measure(params, sp_ids, tf.constant(steps), features//2)
File "embtest.py", line 15, in measure
res = tf.nn.embedding_lookup([params[0:thr],params[thr:]], sp_ids, None, name="TEST1")
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 394, in embedding_lookup_v2
return embedding_lookup(params, ids, "div", name, max_norm=max_norm)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 328, in embedding_lookup
transform_fn=None)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/embedding_ops.py", line 246, in _embedding_lookup_and_transform
ret.set_shape(ids.get_shape().concatenate(element_shape_s))
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1206, in set_shape
if not self.shape.is_compatible_with(shape):
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1167, in shape
self._tensor_shape = tensor_shape.TensorShape(self._shape_tuple())
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.
Largest program allocations in vmem:
XLA label: register allocator spill slots
Allocation type: scoped
XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0}
Allocation type: scoped
XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0}
Allocation type: scoped
XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0}
Allocation type: scoped
XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0}
Allocation type: scoped
2020-10-05 08:23:59.826142: W tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:76] Unable to destroy remote tensor handles. If you are running a tf.function, it usually indicates some op in the graph gets an error: Ran out of memory in memory space vmem. It should not be possible to run out of vmem - please file a bug against XLA.
Largest program allocations in vmem:
XLA label: register allocator spill slots
Allocation type: scoped
XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0}
Allocation type: scoped
XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0}
Allocation type: scoped
XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0}
Allocation type: scoped
XLA label: %concatenate.724 = f32[3840,2]{0,1:T(2,128)} concatenate(f32[1,2]{0,1:T(2,128)}, f32[3,2]{0,1:T(2,128)}, f32[5,2]{0,1:T(2,128)}, f32[1,2]{0,1:T(2,128)}, ...(+2400)), dimensions={0}
Allocation type: scoped
```
**System information**
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
os: Linux
os kernel version: #1 SMP Debian 4.19.146-1 (2020-09-17)
os release version: 4.19.0-11-cloud-amd64
os platform: Linux-4.19.0-11-cloud-amd64-x86_64-with-debian-10.6
linux distribution: ('debian', '10.6', '')
linux os distribution: ('debian', '10.6', '')
mac version: ('', ('', '', ''), '')
uname: uname_result(system='Linux', node='shan-tf1', release='4.19.0-11-cloud-amd64', version='#1 SMP Debian 4.19.146-1 (2020-09-17)', machine='x86_64', processor='')
architecture: ('64bit', 'ELF')
machine: x86_64
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below):
tf.version.VERSION = 2.3.0-dev20200620
tf.version.GIT_VERSION = v1.12.1-34769-gfd2d4cdb70
tf.version.COMPILER_VERSION = 7.3.1 20180303
- Python version:
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory: