Can you bit-pack and then unpack binary inputs?

Hi,

My machine has 128GB RAM. I have a (large) CSV file with binary inputs (0/1) and floating point labels that I want to load complete into RAM as loading and parsing the CSV file in batches is not very fast. Even when the inputs are encoded as bytes it would take more than 128GB RAM to store them. Is there a way these binary inputs can be bit-packed (that would reduce the size by a factor 8) in RAM and use them unpacked for training? So I do not want to bit-pack the inputs and then use the packed integers for training.

Thanks,
GW

ChatGPT 3.5 provides the following answer:

Certainly! Bit-packing is a technique to efficiently store binary data by packing multiple binary values into a single byte. This can be useful when dealing with large datasets of binary inputs, as it can reduce memory usage. In this example, I’ll show you how to bit-pack binary inputs and then unpack them in batches using TensorFlow for training.

Let’s assume you have a dataset of binary inputs represented as arrays of 0s and 1s. Here’s how you can perform bit-packing and unpacking using TensorFlow:

import numpy as np
import tensorflow as tf

Generate some random binary data

num_samples = 1000
input_size = 32
binary_data = np.random.randint(0, 2, size=(num_samples, input_size), dtype=np.uint8)

Function to bit-pack binary data

def bit_pack(data):
packed_data = []
current_byte = 0
bit_count = 0

for bit in data:
    current_byte |= (bit << bit_count)
    bit_count += 1
    
    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
        
if bit_count > 0:
    packed_data.append(current_byte)
    
return np.array(packed_data, dtype=np.uint8)

Function to unpack bit-packed data

def bit_unpack(packed_data, original_size):
unpacked_data = []

for byte in packed_data:
    for _ in range(8):
        unpacked_data.append(byte & 1)
        byte >>= 1
        
return np.array(unpacked_data[:original_size], dtype=np.uint8)

Bit-pack the binary data

packed_binary_data = np.array([bit_pack(sample) for sample in binary_data])

Create a TensorFlow dataset

batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((packed_binary_data, binary_data))
dataset = dataset.batch(batch_size)

Example usage in a training loop

for packed_batch, original_batch in dataset:
unpacked_batch = np.array([bit_unpack(packed_sample, input_size) for packed_sample in packed_batch.numpy()])

# Now you can use the unpacked_batch for training
# ...

So I am trying the ChatGPT suggestion as follows:

def bit_pack(*data):
packed_data = []
current_byte = 0
bit_count = 0

for bit in data:
    current_byte |= (bit << bit_count)
    bit_count += 1
    
    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
        
if bit_count > 0:
    packed_data.append(current_byte)
    
return np.array(packed_data, dtype=np.uint8)

dataset = tf.data.experimental.CsvDataset(
csv, column_types, header=True
)

dataset = dataset.map(bit_pack)

But then I get the error:

File “”, line 8, in bit_pack *
current_byte |= (bit << bit_count)

TypeError: unsupported operand type(s) for <<: 'Tensor' and 'int'

How do I convert the Tensor to an int? bit.numpy() gives a '“AttributeError: ‘Tensor’ object has no attribute ‘numpy’”

Thanks,
GW

I am making some progress. If I change the bit_pack function to:

def bit_pack(data):
packed_data = []
current_byte = 0
bit_count = 0

for bit in data[:-1]:
    current_byte |= (bit.numpy() << bit_count)
    bit_count += 1
    
    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
        
if bit_count > 0:
    packed_data.append(current_byte)
    
return np.array(packed_data, dtype=np.uint8)

and loop over the dataset as follows:

for element in dataset:
print(bit_pack(element))

I get the desired output:

[ 0 0 48 216 47 16 0 0 0 0 0 0 52 74 164 0 0 0
0 0 0 0 0 0]
[ 0 0 64 246 255 31 0 0 0 0 0 128 255 255 7 0 0 0
0 0 0 0 0 0]

But how do I now apply the bit_pack function to the dataset using map?

Thanks,
GW

By gradually modifying the map example from the tf.data.Dataset documentation I have managed to narrow down the issue to:

The following map function does work:

def bit_pack_arg2(*data):
sum = 0;
for bit in data[:-1]:
sum = sum + bit;
return(sum)

dataset = tf.data.experimental.CsvDataset(
csv, column_types, header=True
)
result = dataset.map(bit_pack_arg2)
list(result.as_numpy_iterator())

It returns:

[21, 40, 33, 15, 38, 20, 16, 24, 12]

But the bit_pack_arg function

def bit_pack_arg(*data):
packed_data = []
current_byte = 0
bit_count = 0

for bit in data[:-1]:
    current_byte |= (bit << bit_count)
    bit_count += 1
    
    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
        
if bit_count > 0:
    packed_data.append(current_byte)
    
return np.array(packed_data, dtype=np.uint8)

Gives the error:

TypeError: unsupported operand type(s) for <<: ‘Tensor’ and ‘int’

On the bit << bit_count line. But why does sum = sum + bit work, but current_byte |= (bit << bit_count) not?

Regards,
GW

So after a lot of trial-and-error the following bit_pack_arg function starts to work:

def bit_pack_arg(*data):
packed_data = []
current_byte = 0
bit_count = 0

for bit in data[:-1]:
    #current_byte |= (bit << bit_count)
    current_byte |= tf.bitwise.left_shift(bit, bit_count)
    bit_count += 1

    if bit_count == 8:
        packed_data.append(current_byte)
        current_byte = 0
        bit_count = 0
    
if bit_count > 0:
    packed_data.append(current_byte)

#return np.array(packed_data, dtype=np.uint8)
return(packed_data)

After replacing the ‘<<’ operator with tf.left_shift I got the error:

NotImplementedError: Cannot convert a symbolic tf.Tensor (or_7/BitwiseOr:0) to a numpy array. This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported.

I uncommented the numpy.array call and when bit_pack_arg is mapped to the dataset I now get the output:

[(0,
0,
48,
216,
47,
16,
0,
0,
0,
0,
0,
0,
52,
74,
164,
0,
0,
0,
0,
0,
0,
0,
0,
0),

Close but it is not a nested array of numpy.arrays yet. So I guess what happens is that the dataset creates a symbolic tensor, the ‘+’ operator works on symbolic tensors but ‘<<’ and numpy.array not. So, are these bugs or are there rules for what you can use within tf.data.Dataset map functions? The np.array call converts the packed bytes to np.uint8. How can I convert the packed_data to uint8 without the np.array call?

Thanks,
GW

Oh, the load CSV functions only let you use int32.

Have you tried converting to Parquet and then using TFIO? tfio.experimental.IODataset  |  TensorFlow I/O

Thank you, I will give that a try. Meanwhile ChatGPT 3.5 has answered my ‘<<’ question:

I apologize for any confusion earlier. You are correct; the << operator does not work on symbolic tensors within TensorFlow datasets. The << operator is intended for use with regular Python integers and cannot be directly applied to symbolic tensors. If you want to left-shift elements within a TensorFlow dataset, you would need to use TensorFlow operations that work on tensors.

Regards,
GW

Eureka after mixing Google, ChatGPT, Trial And Error like an alchemist.
The following code reads a CSV file with a header ‘PACK0, PACK1, …, PACK23, RESULT’, each row containing 192 binary (0/1) inputs that are supposed to be encoded as 24 8-bit ints (so 1,1,1,1,1,1,1,1 is encoded is 255) and one float output and unpacks the inputs in batches before training:

def unpackbits_tf(features, labels):
    mask = tf.constant([128, 64, 32, 16, 8, 4, 2, 1], dtype=features.dtype)
    expanded_features = tf.expand_dims(features, -1)
    unpacked = tf.cast(tf.bitwise.bitwise_and(expanded_features, mask) > 0, tf.int32)
    return tf.reshape(unpacked, [-1, features.shape[1] * 8]), labels

features = pd.read_csv('demo.csv', dtype = 'uint8', converters = {'RESULT': float})
labels = features.pop('RESULT')

x_train, x_test, y_train, y_test = train_test_split(features.to_numpy(), labels.to_numpy(), test_size=0.2)

BATCH_SIZE=128
with tf.device("CPU"):
    train = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(4 * BATCH_SIZE)
    validate = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(BATCH_SIZE)

train = train.map(unpackbits_tf)
validate = validate.map(unpackbits_tf)

model = tf.keras.Sequential([layers.Dense(192,activation="relu"),
                               layers.Dense(16,activation="relu"),
                               layers.Dense(16,activation="relu"),
                               layers.Dense(1,activation="sigmoid")])

model.compile(optimizer=tf.keras.optimizers.Adam(),loss = tf.keras.losses.MeanSquaredError())

callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='auto', patience=10)

history = model.fit(train,
                      epochs=1000,
                      callbacks=[callback],
                      validation_data=validate)

Regards,
GW