Tensorflow Decision Forests: How to encode a Tensor of categorical numeric data

Matt_Miller · February 17, 2022, 7:08pm

Hi All,

I am using Tensorflow Decision Forests and want to understand how to preprocess a tensor using keras so that the TF-DF correctly handles numeric categorical data.

In production, the model will receive a 2D tf.float32 Tensor of features, some of which are categorical and some numeric, so all preprocessing will need to be done within the model. The format of the input data cannot be changed. The Random Forest is part of a larger tf.keras.models.Model, with later models consuming the output of this first model.

I believe I need to set the features argument when initialising the tfdf.keras.RandomForestModel to tell the model about the categorical features, however all of the examples assume the model is being passed a Dictionary of tensors, with the categorical features being identified by dict key. Is there a way of telling the model about the categorical features by Tensor index or something like that?

I’ve found out that TF-DF will not accept a tf.float32 type Tensor as categorical data, so the first step I would assume is to split out and cast the categorical features as tf.int32.

For a toy example, how would I preprocess a simple tensor e.g.

[[1.1,4.1,1],
 [3.1,5.1,2],
 [2.1,6.1,3],
 [3.1,7.1,4]]

to work with TF-DF, where the 3rd feature (1,2,3,4) should be treated as categorical?

Thanks in advance!

lgusm · February 21, 2022, 12:10pm

@Mathieu can help here

Mathieu · February 21, 2022, 2:20pm

Hi Matt,

Your description of the situation is correct.

As you noted, you can specify the semantic (numerical, categorical) of input features using the features arguments. However, this only works if the input features are presented as a dictionary.

A solution is to separate the numerical and categorical features before feeding them into the model. You will end-up with a model that consumes dictionaries. If you need your model to consume a feature matrix, you can them group the “separation logic” and the “dictionary model” into a new supermodel using the Keras model functional API.

Alternatively, an equivalent, but simpler solution, is to use the processing argument available in all the TF-DF model and inject the separation logic inside of the model.

Here is an example:

features = [[1.1,4.1,1],
            [3.1,5.1,2],
            [2.1,6.1,3],
            [3.1,7.1,4]]

labels = [0,1,0,1]

# A matrix training dataset.
tf_dataset = tf.data.Dataset.from_tensor_slices((features,labels)).batch(2)

def preprocessing(features):
  """Splits the feature matrix into a dictionary of features."""

  # The first two columns are numerical.
  numerical_features = features[:,:2]
  # The last two columns are categorical.
  categorical_features = features[:,2:]
  return {"numerical_features" : numerical_features,
          "categorical_features" : tf.cast(categorical_features,tf.int32)}

# Specify the semantic of the features.
features = [
  tfdf.keras.FeatureUsage(name="numerical_features", semantic=tfdf.keras.FeatureSemantic.NUMERICAL),
  tfdf.keras.FeatureUsage(name="categorical_features", semantic=tfdf.keras.FeatureSemantic.CATEGORICAL),
]

model = tfdf.keras.GradientBoostedTreesModel(
    verbose=2,
    preprocessing=preprocessing,
    features=features)
model.fit(tf_dataset)

Following is the part of the training logs that describe the dataset:

Training dataset:
Number of records: 4
Number of columns: 4

Number of columns by type:
	CATEGORICAL: 2 (50%)
	NUMERICAL: 2 (50%)

Columns:

CATEGORICAL: 2 (50%)
	0: "categorical_features" CATEGORICAL integerized vocab-size:6 no-ood-item
	3: "__LABEL" CATEGORICAL integerized vocab-size:3 no-ood-item

NUMERICAL: 2 (50%)
	1: "numerical_features.0" NUMERICAL mean:2.35 min:1.1 max:3.1 sd:0.829156
	2: "numerical_features.1" NUMERICAL mean:5.6 min:4.1 max:7.1 sd:1.11803

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute which type is manually defined by the user i.e. the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

You can see that 2 features are considered NUMERICAL and the 2 other are CATEGORICAL.

I hope this helps.
M.

Matt_Miller · February 25, 2022, 11:42am

Hi @Mathieu,

Thanks for the response, that was helpful. I had to implement it as part of a keras functional model as I want to run 2 random forests in series - the first one to make the prediction, and the second one to bias correct that prediction. Here is a notebook where I prototyped my idea, I figured it might be helpful for others wanting to use keras and TF-DF with a dict: Google Colab

In particular, I was unsure whether keras functional models were compatible with dict inputs. It turns out they are, with some caveats. See: Named dictionary inputs and outputs for tf.keras.Model · Issue #34114 · tensorflow/tensorflow · GitHub

While the model consumes and produces a single Tensor, dictionaries are used in the intermediate stages. This IMO is especially important for Decision Forests where users might want to understand feature importances in their models, which is made a lot more convenient with a named dict for each input feature.