Predict with tfdf

Cristelle_Barillon · June 9, 2021, 9:44am

#help_request #tfdf
I can’t seem to be able to make prediction using a tfdf model, I’m working in colab and I obtain an “assertion error” anytime I try using something like model(Xtest) or model.predict(Xtest) while everything seems to go smooth up to that point.
Any idea of what goes wrong ?

Thanks,
Cristelle.

lgusm · June 9, 2021, 11:21am

Hi Cristelle,

can you post the code you used?
Using the beginner colab, after the cell with model_1.save(…) (in the section Prepare this model for TensorFlow Serving) I’ve tried this:

for r in test_ds.take(1):
  print(model_1(r[0]))

And I got the expected predictions

Cristelle_Barillon · June 9, 2021, 4:28pm

indeed it works this way! thanks a lot!
I guess I was feeding in the wrong data types…
Here’s my code, and there are probably a lot of more than awkward things… If anyone has the time to comment on that, that would be much helpful. I was trying to use the iris dataset:
import tensorflow_decision_forests as tfdf
import pandas as pd
from sklearn.datasets import load_iris

data2 = load_iris(as_frame=True)
df = pd.DataFrame(data = data2.data)
allcol = df.columns
i = 0
for cn in allcol:
  df =df.rename(columns={cn:str(i)})
  i +=1

df_class = pd.DataFrame(data = data2.target)
complete_df = pd.concat([df, df_class], axis=1)
complete_df = complete_df.sample(frac=1)
train_df = complete_df[:100]
test_df = complete_df[100:]
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="target")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="target")

model = tfdf.keras.RandomForestModel()
model.fit(train_ds)
Xtest = test_df.iloc[:,:4]
ytest = test_df.iloc[:,4]
y_hats = model.predict(Xtest)

Thanks a lot
Cristelle

Mathieu · June 9, 2021, 5:28pm

Great to hear that it works .

Thanks for the code snippet. Some of the error messages are certainly not explicit enough. We will improve that in the next TF-DF release.

Regarding your example, the issue is that Xtest is a Pandas dataframe, while predict expects a TensorFlow dataset, a Numpy array or a Tensor (or one of the more exotique formats such as DatasetCreator).

Your code could be re-written as follow:

import tensorflow_decision_forests as tfdf
import pandas as pd
from sklearn.datasets import load_iris

iris_frame = load_iris(as_frame=True)
iris_dataframe = pd.DataFrame(data = iris_frame.data)
iris_dataframe["species"] = iris_frame.target

# Replace the spaces by "_" in the feature names.
iris_dataframe = iris_dataframe.rename(columns=lambda x: x.replace(" ","_"))

# Shuffle the dataset
iris_dataframe = iris_dataframe.sample(frac=1)

# Train/Test split.
train_dataframe = iris_dataframe[:100]
test_dataframe = iris_dataframe[100:]

# Converts from Pandas dataframes to TensorFlow datasets. 
train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(train_dataframe, label="species")
test_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(test_dataframe, label="species")

# Train the model.
model = tfdf.keras.RandomForestModel()
model.fit(train_dataset)

# Generate the predictions.
model.predict(test_dataset)

Keras also directly supports the consumption of Numpy arrays. This option is less powerful than Pandas Dataframes, but in your case, it leads to a more compact code:

import tensorflow_decision_forests as tfdf
import pandas as pd
from sklearn.datasets import load_iris

iris_frame = load_iris()
features = iris_frame.data
labels = iris_frame.target

# Shuffle the examples.
permutations = np.random.permutation(features.shape[0])
features = features[permutations]
labels = labels[permutations]

train_features = features[:100]
train_labels = labels[:100]

test_features = features[100:]
test_labels = labels[100:]

model = tfdf.keras.RandomForestModel()
model.fit(x=train_features, y=train_labels)

model.predict(test_features)

Cheers,

Edit: Shuffle the examples before the train/test split.

Cristelle_Barillon · June 9, 2021, 5:33pm

Excellent! Thanks a lot Mathieu, that’s really helping me a lot.
Best
Cristelle

Mathieu · June 9, 2021, 5:49pm

Happy to help.

The Iris examples returned by sklearn are grouped by classes. Therefore, the :100 vs 100: split would be of poor quality for a train/test evaluation. Shuffling the examples before the split solves the issue. The example above was edited accordingly.