Image processing in TFX preprocessing_fn

tim_broadhurst · May 10, 2022, 5:05am

Hi,

After doing the coursera MLOPS course I’m experimenting with TFX and I am using the MNIST dataset.

The dataset is downloaded as numpy arrays and I have encoded as TF.record. To do this the array is serialized and encoded as bytes. (see below). It all works fine and I get a lovely schema and some statistics all working great.

But then I get to the pre_processing function and cannot for the life of me figure how how to process the images. I need to use tft or tf.io functions but there is nothing that turns them from bytes back to arrays (to, for example normalise the values by dividing by 255).

I have found an example of converting a single record back in the tf.train.example section but this doesn’t seem to work on the input the pre_procesing function gets despite me trying to map it.

Any clues!?

Setting up the record files

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))): # if value ist tensor
        value = value.numpy() # get value of tensor
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))



def serialize_array(array):
    array = tf.io.serialize_tensor(array)
    return array

def image_label_to_tf_train(image, label):

    image_shape = np.shape(image)
    #define the dictionary -- the structure -- of our single example
    data = {
        'height': _int64_feature(image_shape[0]),
        'width': _int64_feature(image_shape[1]),
        'raw_image' : _bytes_feature(serialize_array(image)),
        'label' : _int64_feature(label)
    }
    #create an Example, wrapping the single features
    return tf.train.Example(features=tf.train.Features(feature=data))


def write_images_to_tfr_short(images, labels, filename:str="images", folder = ""):
    if not os.path.isdir(folder):
        !mkdir {folder}
    filename= folder + "/" + filename+".tfrecords"
    writer = tf.io.TFRecordWriter(filename) #create a writer that'll store our data to disk
    count = 0

    for index in range(len(images)):

        #get the data we want to write
        current_image = images[index]
        current_label = labels[index]

        out = image_label_to_tf_train(image=current_image, label=current_label)
        writer.write(out.SerializeToString())
        count += 1

    writer.close()
    print(f"Wrote {count} elements to TFRecord")
    return count

write_images_to_tfr_short(train_x, train_y, filename= "training_image_record", folder = train_folder)

My curren pre_processing function that doesnt work (note the labels output fine!)

%%writefile {_mnist_transform_module}

import numpy as np
import tensorflow as tf
import os
from tfx import v1 as tfx
from tfx import proto
from tfx.proto import example_gen_pb2
from tfx.components import example_gen
from tfrecord_lite import decode_example
import mnist_constants
from tfrecord_lite import tf_record_iterator
_LABEL_KEY = mnist_constants.LABEL_KEY
_IMAGE_KEY = mnist_constants.IMAGE_KEY


# Define the transformations
def preprocessing_fn(inputs):
    """tf.transform's callback function for preprocessing inputs.
    Args:
        inputs: map from feature keys to raw not-yet-transformed features.
    Returns:
        Map from string feature key to transformed feature operations.
    """
    image_feature_description = {
    'image_raw': tf.io.FixedLenFeature([], tf.string),
}

    # Initialize outputs dictionary
    outputs = {}
    
    
    raw_image_dataset = inputs[_IMAGE_KEY]
    
    
    def _parse_image_function(example_proto):
  # Parse the input tf.train.Example proto using the dictionary above.
        return tf.io.parse_single_example(example_proto, image_feature_description)
    
    parsed_image_dataset = tf.map_fn(_parse_image_function, raw_image_dataset)
    outputs[_IMAGE_KEY] = parsed_image_dataset
  
    
    
    outputs[_LABEL_KEY] = tf.cast(inputs[_LABEL_KEY], tf.int64)



    return outputs

tim_broadhurst · May 11, 2022, 11:47pm

So I think I have made progress on this figuring out that some functions don’t handle batch so need to be mapped plus reading up more about data types.

The following at least does not error out:

raw_image_dataset = inputs[_IMAGE_KEY]
    
    raw_image_dataset = tf.map_fn(fn = lambda x : tf.io.decode_image(x[0]) , elems = raw_image_dataset, dtype=tf.uint8)

So seems to be doing the required but I cannot get it to output becuase when I then do

outputs[_IMAGE_KEY] = raw_image_dataset

and call the transform function

transform = tfx.components.Transform(
    examples=example_gen.outputs['examples'],
    schema=schema_gen.outputs['schema'],
    module_file=os.path.abspath(_mnist_transform_module))
context.run(transform, enable_cache=False)

I get an error

ValueError: Feature raw_image (Tensor("Identity_1:0", dtype=uint8)) had invalid dtype <dtype: 'uint8'> for feature spec

Which I assume is something where the outputs are not matching schema but not sure why this is occurring as I have seen similar done in other examples

markdaoust · May 12, 2022, 11:20pm

tim_broadhurst:

def serialize_array(array):
    array = tf.io.serialize_tensor(array)
    return array

def image_label_to_tf_train(image, label):

    image_shape = np.shape(image)
    #define the dictionary -- the structure -- of our single example
    data = {
        'height': _int64_feature(image_shape[0]),
        'width': _int64_feature(image_shape[1]),
        'raw_image' : _bytes_feature(serialize_array(image)),

https://www.tensorflow.org/api_docs/python/tf/io/serialize_tensor

you converted the tensor to bytes using serialize_tensor. parse_tensor is the inverse of serialize_tensor:

https://www.tensorflow.org/api_docs/python/tf/io/parse_tensor

tim_broadhurst · May 12, 2022, 11:41pm

Thanks so much!

Now using this but getting errors about feature ranks?

raw_image_dataset = tf.map_fn(fn = lambda x : tf.io.parse_tensor(x[0], tf.int64, name=None), elems = raw_image_dataset, fn_output_signature = tf.int64, infer_shape = True)

outputs["Feature1"] = raw_image_dataset

ValueError: Feature Feature1 (Tensor("Identity:0", dtype=int64)) had invalid shape <unknown> for FixedLenFeature: must have rank at least 1

tim_broadhurst · May 13, 2022, 2:10am

Further to this I can do

raw_image_dataset = tf.map_fn(fn = lambda x : tf.io.parse_tensor(x[0], tf.uint8, name=None), elems = raw_image_dataset, fn_output_signature = tf.TensorSpec((1,),dtype=tf.uint8,    name=None), infer_shape = False)
    raw_image_dataset = tf.cast(raw_image_dataset, tf.int64)

But get error

ried to set a tensor with incompatible shape at a list index. Item element shape: [28,28] list shape: [1]
	 [[{{node map/while/TensorArrayV2Write/TensorListSetItem}}]] [Op:__inference_wrapped_finalized_224609]".
          Batch instances: pyarrow.RecordBatch
raw_image: large_list<item: large_binary>
  child 0, item: large_binary
label: large_list<item: int64>
  child 0, item: int64,
          Fetching the values for the following Tensor keys: {'Feature1', 'label'}. [while running 'Transform[TransformIndex0]/Transform']

Which shows I’m decoding fine (size 28,28 is the original size) but coming a cropper on the transform function specifics somehow

tim_broadhurst · May 16, 2022, 8:55am

To anyone struggling with something similar, this code works

raw_image_dataset = tf.map_fn(fn = lambda x : tf.io.parse_tensor(x[0], tf.uint8, name=None), elems = raw_image_dataset, fn_output_signature = tf.TensorSpec((28,28),dtype=tf.uint8,    name=None), infer_shape = True)
    raw_image_dataset = tf.cast(raw_image_dataset, tf.int64)
    outputs[_IMAGE_KEY] = raw_image_dataset