CNN combined with TF DF Random forest model

Geerthy_Thambiraj · March 12, 2024, 5:06am

I have time series data and tabular data. I have developed the hybrid model which takes time series data as input to CNN architecture and tabular data to the TF DF random forest model, which is given to the FC layer for prediction. I would like to know the feature importance of the tabular data. When I try to use the following line to get feature importance it says:
inspector = rf_model_layer.make_inspector()
TypeError: the object of type ‘NoneType’ has no len()
I have attached the model summary : Model: “model_1”

Layer (type) Output Shape Param # Connected to

input_3 (InputLayer) [(None, 12, 3000, 1)] 0 []

conv2d_2 (Conv2D) (None, 1, 2876, 16) 24016 [‘input_3[0][0]’]

conv2d_3 (Conv2D) (None, 1, 2837, 32) 20512 [‘conv2d_2[0][0]’]

input_4 (InputLayer) [(None, 60)] 0 []

flatten_1 (Flatten) (None, 90784) 0 [‘conv2d_3[0][0]’]

random_forest_model_1 (Ran (None, 1) 1 [‘input_4[0][0]’]
domForestModel)

concatenate_1 (Concatenate (None, 90785) 0 [‘flatten_1[0][0]’,
) ‘random_forest_model_1[0][0]’
]

dense_2 (Dense) (None, 32) 2905152 [‘concatenate_1[0][0]’]

dense_3 (Dense) (None, 1) 33 [‘dense_2[0][0]’]

==================================================================================================
Total params: 2949714 (11.25 MB)
Trainable params: 2949713 (11.25 MB)
Non-trainable params: 1 (1.00 Byte)

None
Any suggestions would be appreciated.
Thank you

rcauvin · March 13, 2024, 2:47pm

When you construct the model, are you specifying that it should compute out-of-bag (OOB) variable importances?

model = tfdf.keras.RandomForestModel(compute_oob_variable_importances=True)

More here.

Geerthy_Thambiraj · March 13, 2024, 3:16pm

Thank you very much. I did not specify variable importance to be true. Let me try. I also have the query that the below combined model does train both CNN and RF model when I use tuner_search on it? Here’s the sample model built.

class CombinedModel(HyperModel):
            def __init__(self, cnn_input_shape, rf_input_shape):
                self.cnn_input_shape = cnn_input_shape
                self.rf_input_shape = rf_input_shape
            
            def build(self, hp):
                            # CNN part
                    cnn_input = Input(shape=self.cnn_input_shape)
                    RawInputECG = Input(shape=(12,301,1))
                    cnn_output = tf.keras.layers.Conv2D(filters=16, kernel_size=(12, 125), activation='relu')(cnn_input)
                    cnn_output = tf.keras.layers.Conv2D(filters=32, kernel_size=(1, 40), activation='relu')(cnn_output)
                    cnn_output = tf.keras.layers.Flatten()(cnn_output) # here the parameters are fixed to test the model.
                   # RF part
                   rf_output = tfdf.keras.RandomForestModel(
                                      num_trees=fixed_hyperparameters['rf_num_trees'],
                                      max_depth=fixed_hyperparameters['rf_max_depth'],
                                      min_examples=fixed_hyperparameters['min_examples']
                                      )(rf_input)
  
                  # Combine CNN and RF outputs
                  combined_layer = concatenate([cnn_output,  rf_output])
                  # Fully Connected layer
                  fc_activation = hp.Choice('fc_activation', values=['relu', 'sigmoid'])
                  fc_layer = Dense(32, activation=fc_activation)(combined_layer)
                  # Output layer
                  output_layer = Dense(1, activation='relu')(fc_layer)
                  model = Model(inputs=[cnn_input, rf_input], outputs=output_layer)
                  model.compile(optimizer=optimizer, loss='mse',      metrics=[tf.keras.metrics.RootMeanSquaredError(), mae_error])
                  return model
  if __name__ == '__main__':
      cnn_input_shape = (12, 301, 1)
      rf_input_shape = ( 60,)
      combined_model = CombinedModel(cnn_input_shape, rf_input_shape)
   # loading the data for CNN and RF 
  # loading the tuner    
      tuner_bo = RandomSearch(
          combined_model,
          objective=keras_tuner.Objective("val_loss", direction="min"),
          max_trials=50,
          seed=16,
          executions_per_trial=1,
          overwrite=False,
          project_name="Hybrid_model")
      es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=20)
      mc = ModelCheckpoint(saveFN, monitor='val_loss', mode='min', verbose=1, save_best_only=True)
      tuner_bo.search([dataTrain,dataTrainRF], labelsTrain,
                            validation_data = ([dataVal, dataValRF], labelsVal))

In the above code, rf_model is one of the layers in the combined model. Does it get trained when I call tuner_bo.search? or do i have to extract the best hyperparameters and call the best_ model.fit() to make sure both the CNN and RF layer in the combined model are trained as shown below?

# Get the best hyperparameters
best_hyperparameters = keras_tuner.get_best_hyperparameters()[0]

# Build the model with the best hyperparameters
best_model = build_model(best_hyperparameters)

# Train the model
best_model.fit(train_ds, epochs=num_epochs, validation_data=valid_ds)

# Evaluate the model on the test dataset
test_loss, test_accuracy = best_model.evaluate(test_ds)

I also get the following warning when I use best_model.fit()
WARNING:absl:The model was called directly (i.e. using model(data) instead of using model.predict(data)) before being trained. The model will only return zeros until trained. The output shape might change after training Tensor(“inputs:0”, shape=(None, 60), dtype=float32).

Any suggestions would be appreciated.
thank you very much.

rcauvin · March 13, 2024, 6:08pm

Yes, when you invoke search on the tuner, it should cause the combined model (actually a bunch of candidate models) to be trained in multiple trials as it determines the best combination of hyperparameters. Since the rf model is part of the combined model, it should be trained alongside the cnn model.

You should be able to get the best model (which was already trained in the tuning process):

best_model = tuner.get_best_models(num_models=1)[0]

You may use the best model as is, but as you have done, you can retrain it (examples suggest using the entire dataset (training and validation data combined) using the best hyperparameters.

The tutorial gives as an example:

hypermodel = MyHyperModel()
best_hp = tuner.get_best_hyperparameters()[0]
model = hypermodel.build(best_hp)
hypermodel.fit(best_hp, model, x_all, y_all, epochs=1)

So you might modify your code to match that example and see if it makes a difference.

Geerthy_Thambiraj · March 13, 2024, 7:01pm

Thank you @rcauvin. Let me try the approach you have suggested and update here if it works. Regarding my first question, it still gets the TypeError: the object of type ‘NoneType’ has no len() even after I specify variable_importance=True. Here’s the code of how I access the random forest model layer from the combined_model

rf_model_layer = model.layers[5]  # Assuming the random forest model is the 5th layer of the combined_model
inspector = rf_model_layer.make_inspector()

It works fine when I call the TFDF RF model alone (not including the CNN model) on tabular data, I get the feature importance.
Since the tfdf random forest model is one of the layers of the Keras model, can I not use make_inspector() on the model directly to get the feature importance? Or the RandomForest model is not trained so I couldn’t access the attribute make_inspector()

I apologize for any inconvenience.
Thank you very much

rcauvin · March 13, 2024, 7:32pm

In

rf_model_layer = model.layers[5]  # Assuming the random forest model is the 5th layer of the combined_model

How are you getting model?

Geerthy_Thambiraj · March 13, 2024, 7:58pm

This is how i get the model :


cnn_input = tf.keras.Input(shape=(12, 301, 1))
rf_input = tf.keras.Input(shape=(60,))
cnn_output = tf.keras.layers.Conv2D(filters=16, kernel_size=(12, 125), activation='relu')(cnn_input)
cnn_output = tf.keras.layers.Conv2D(filters=32, kernel_size=(1, 40), activation='relu')(cnn_output)
cnn_output = tf.keras.layers.Flatten()(cnn_output)

rf_output = tfdf.keras.RandomForestModel(
    num_trees=fixed_hyperparameters['rf_num_trees'],
    max_depth=fixed_hyperparameters['rf_max_depth'],
    min_examples=fixed_hyperparameters['min_examples'],
    compute_oob_variable_importances=True
)(rf_input)

combined_output = tf.keras.layers.concatenate([cnn_output, rf_output])
fc_output = tf.keras.layers.Dense(32, activation='relu')(combined_output)
output = tf.keras.layers.Dense(1, activation='relu')(fc_output)

model = tf.keras.Model(inputs=[cnn_input, rf_input], outputs=output)

optimizer = tf.keras.optimizers.Adam(learning_rate=fixed_hyperparameters['learning_rate'])
model.compile(optimizer=optimizer, loss='mse')


trained_model=model.fit([dataTrain, dataTrainRF], labelsTrain, validation_data=([dataVal, dataValRF], labelsVal), epochs=5)

rf_model_layer = model.layers[5]  # Assuming the random forest model is the 5th layer
inspector = rf_model_layer.make_inspector()

This is the complete error :
Traceback (most recent call last):

Cell In[184], line 2
inspector = rf_model_layer.make_inspector()

File ~/anaconda3/lib/python3.11/site-packages/tensorflow_decision_forests/keras/core_inference.py:411 in make_inspector
path = self.yggdrasil_model_path_tensor().numpy().decode(“utf-8”)

File ~/anaconda3/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py:153 in error_handler
raise e.with_traceback(filtered_tb) from None

File /tmp/autograph_generated_file0diouhww.py:38 in tf__yggdrasil_model_path_tensor
ag.if_stmt(ag__.ld(multitask_model_index) >= ag__.converted_call(ag__.ld(len), (ag__.ld(self)._models,), None, fscope), if_body, else_body, get_state, set_state, (), 0)

TypeError: in user code:

File "/home/hybrid/anaconda3/lib/python3.11/site-packages/tensorflow_decision_forests/keras/core_inference.py", line 436, in yggdrasil_model_path_tensor  *
    if multitask_model_index >= len(self._models):

TypeError: object of type 'NoneType' has no len()

Here are the layers present in the model.

layers = model.layers
print(layers)

Output
[<keras.src.engine.input_layer.InputLayer object at 0x7f00587cf390>, <keras.src.layers.convolutional.conv2d.Conv2D object at 0x7f005840fed0>, <keras.src.layers.convolutional.conv2d.Conv2D object at 0x7f0058763b50>, <keras.src.engine.input_layer.InputLayer object at 0x7f00587095d0>, <keras.src.layers.reshaping.flatten.Flatten object at 0x7f00586242d0>, <tensorflow_decision_forests.keras.RandomForestModel object at 0x7f0058463710>, <keras.src.layers.merging.concatenate.Concatenate object at 0x7f00584604d0>, <keras.src.layers.core.dense.Dense object at 0x7f005845fe10>, <keras.src.layers.core.dense.Dense object at 0x7f00587b8910>]
Thank you

rcauvin · March 13, 2024, 9:25pm

What is the output of print(trained_model.layers?

And have you tried retrieving rf_model_layer from trained_model instead of model?

Geerthy_Thambiraj · March 13, 2024, 9:35pm

I get this error when I try to print the trained_model.layers:

print(trained_model.layers)
Traceback (most recent call last):

  Cell In[214], line 1
    print(trained_model.layers)

AttributeError: 'History' object has no attribute 'layers'

Thank you very much for your prompt response

rcauvin · March 13, 2024, 10:53pm

Sorry, I forgot that model.fit returns a History object.

This tutorial shows how to stitch together neural networks and decision forest models, but it trains the decision forest models separately before combining them into a single ensemble model.

Geerthy_Thambiraj · March 13, 2024, 11:38pm

Thank you very much for the reference link.

What is happening in the following lines of code in the above reference link? The combined model ( NN and DF, ensemble_nn_and_df) has not been trained before. It’s just two DF models trained separately and how is it reflected in the ensemble_nn_and_df?

Let's train the two Decision Forest components (one after another).
%%time
train_dataset_with_preprocessing = train_dataset.map(lambda x,y: (preprocessor(x), y))
test_dataset_with_preprocessing = test_dataset.map(lambda x,y: (preprocessor(x), y))

model_3.fit(train_dataset_with_preprocessing)
model_4.fit(train_dataset_with_preprocessing)

mean_nn_and_df = tf.reduce_mean(
    tf.stack([m1_pred, m2_pred, m3_pred, m4_pred], axis=0), axis=0)
ensemble_nn_and_df = tf_keras.models.Model(raw_features, mean_nn_and_df)
ensemble_nn_and_df.compile(
    loss=tf_keras.losses.BinaryCrossentropy(), metrics=["accuracy"])
evaluation_nn_and_df = ensemble_nn_and_df.evaluate(
    test_dataset, return_dict=True)

I have numpy array for the CNN model and tabular data (with categorical variables) for RF data. When I use tfdf.keras.pd_dataframe_to_tf_dataset for tabular data for the TFDF Random model (as it contains categorical variables) and tf. data.Dataset.from_tensor_slices for CNN model layers. It is not compatible. Do you have any suggestions on how to make two inputs to be fed compatible with two models? Or if we train models separately as mentioned in the reference link, the compatibility issue wouldn’t arise.

Any help is appreciated.
Thank you very much.

rcauvin · March 14, 2024, 5:21pm

It looks like the tutorial creates and stitches the models together before compiling or training them. Then it compiles and trains the neural network and the decision forests separately, after which ensemble_nn_and_df is compiled and evaluated.

I think you have a few options for dealing with the input dataset and using it to train the different models:

Train the models separately as in the tutorial (but with preprocessed input for the CNN model and tabular data input for the decision forest model).
Add some feature preprocessing layers so that the CNN model receives preprocessed input while the decision forest model receives the tabular data.
Use preprocessed input for both models.

Mathieu · April 17, 2024, 1:48pm

Hi Geerthy,

As you noted, the random forest model in your example is not trained when calling “fit” on the combined model. Because random forests don’t train with back-propagation, TF-DF model can only be trained by calling “fit” directly on them. And, calling “fit” on the combined model does call fit on the individual models.

In the tutorial that @rcauvin linked (thanks), you can see random forests are trained individually (model_4.fit(...) and model_3.fit(...)).

Most of the errors you see are due to the RF model not being trained. This kind of error was a limitation of the TF-DF imposed by the Keras API. We’ve wrote a guide to help users figuring-out the differences.

An alternative and better solution for you is to replace your TF-DF code with YDF code. YDF is the successor of TF-DF (see the TF-DF homepage announcement). TF-DF and YDF use the same learning algorithm implementations, but the YDF’s API was improved, has more features for model understanding, and is less prone to error (see details here). Also, YDF is compatible both with Keras 2 and Keras 3, while TF-DF is currently only compatible with Keras 2.

This tutorial shows different ways to compose neural networks with decision forest models using YDF.

I hope this helps.

rcauvin · April 17, 2024, 6:57pm

Thanks for the suggestion to move to YDF. Unfortunately, I’m having trouble with just importing the ydf package.

!pip install -U ydf

import ydf

I’m getting an error on the import in my SageMaker Jupyter notebook:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[178], line 1
----> 1 import ydf

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/__init__.py:23
     20 from ydf.version import version as __version__
     22 # Dataset
---> 23 from ydf.dataset.dataset import create_vertical_dataset
     24 from ydf.dataset.dataspec import Column
     25 from ydf.dataset.dataspec import Semantic

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/dataset/dataset.py:24
     22 from yggdrasil_decision_forests.dataset import data_spec_pb2
     23 from ydf.cc import ydf
---> 24 from ydf.dataset import dataspec
     25 from ydf.dataset.io import dataset_io
     26 from ydf.dataset.io import dataset_io_types

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/dataset/dataspec.py:41
     37 YDF_OOD = "<OOD>"
     39 # Mapping between Numpy dtypes and YDF dtypes.
     40 _NP_DTYPE_TO_YDF_DTYPE = {
---> 41     np.int8: ds_pb.DType.DTYPE_INT8,
     42     np.int16: ds_pb.DType.DTYPE_INT16,
     43     np.int32: ds_pb.DType.DTYPE_INT32,
     44     np.int64: ds_pb.DType.DTYPE_INT64,
     45     np.uint8: ds_pb.DType.DTYPE_UINT8,
     46     np.uint16: ds_pb.DType.DTYPE_UINT16,
     47     np.uint32: ds_pb.DType.DTYPE_UINT32,
     48     np.uint64: ds_pb.DType.DTYPE_UINT64,
     49     np.float16: ds_pb.DType.DTYPE_FLOAT16,
     50     np.float32: ds_pb.DType.DTYPE_FLOAT32,
     51     np.float64: ds_pb.DType.DTYPE_FLOAT64,
     52     np.bool_: ds_pb.DType.DTYPE_BOOL,
     53     np.string_: ds_pb.DType.DTYPE_BYTES,
     54     np.str_: ds_pb.DType.DTYPE_BYTES,
     55     np.bytes_: ds_pb.DType.DTYPE_BYTES,
     56     np.object_: ds_pb.DType.DTYPE_BYTES,
     57 }
     59 NP_SUPPORTED_INT_DTYPE = [
     60     np.int8,
     61     np.int16,
   (...)
     67     np.uint64,
     68 ]
     70 NP_SUPPORTED_FLOAT_DTYPE = [
     71     np.float16,
     72     np.float32,
     73     np.float64,
     74 ]

AttributeError: module 'yggdrasil_decision_forests.dataset.data_spec_pb2' has no attribute 'DType'

However, when I execute it in a Google Colab notebook, it didn’t result in an error.

rcauvin · April 17, 2024, 10:51pm

When I reran the SageMaker notebook from scratch, import ydf worked fine.

Mathieu · April 18, 2024, 3:43pm

Hi Roger,

Thanks for the alert. There is a conflict between the dependency of TF-DF and YDF.
Installing YDF first TF-DF will result in an error, while installing TF-DF first will work.
I’ll release a fix.

In the meantime, you can solve the problem by uninstalling both, and then installing TF-DF first, and YDF after.

pip uninstall tensorflow_decision_forests
pip uninstall ydf
pip install tensorflow_decision_forests
pip install ydf

Or by, forcing the re-installation of YDF

pip install ydf --force

rcauvin · April 18, 2024, 11:23pm

Another thing I noticed is that I can’t seem to train the model with a tf.data.CacheDataset. An instance of tf.data.BatchDataset works, but if I cache it, it doesn’t.

Mathieu · April 19, 2024, 2:44pm

Thanks :).

Both the dependency version collision and the support for CacheDataset have been addressed in YDF 0.4.1

rcauvin · April 20, 2024, 4:13am

There may be still be an issue with conflicts between TF-DF and YDF. When I execute:

df_model.to_tensorflow_saved_model(path="multi/ranking/1/", mode="tf")

in ydf 0.4.1, it outputs this error:

TypeError                                 Traceback (most recent call last)
Cell In[55], line 4
      1 # Export the model to the TensorFlow SavedModel format.
      2 # The model can be executed with Servomatic, TensorFlow Serving and
      3 # Vertex AI.
----> 4 df_model.to_tensorflow_saved_model(path="multi/ranking/1/", mode="tf")

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/model/generic_model.py:729, in GenericModel.to_tensorflow_saved_model(self, path, input_model_signature_fn, mode, feature_dtypes, servo_api, feed_example_proto, pre_processing, post_processing, temp_dir)
    602 def to_tensorflow_saved_model(  # pylint: disable=dangerous-default-value
    603     self,
    604     path: str,
   (...)
    613     temp_dir: Optional[str] = None,
    614 ) -> None:
    615   """Exports the model as a TensorFlow Saved model.
    616 
    617   This function requires TensorFlow and TensorFlow Decision Forests to be
   (...)
    726       (default), uses `tempfile.mkdtemp` default temporary directory.
    727   """
--> 729   export_tf.ydf_model_to_tensorflow_saved_model(
    730       ydf_model=self,
    731       path=path,
    732       input_model_signature_fn=input_model_signature_fn,
    733       mode=mode,
    734       feature_dtypes=feature_dtypes,
    735       servo_api=servo_api,
    736       feed_example_proto=feed_example_proto,
    737       pre_processing=pre_processing,
    738       post_processing=post_processing,
    739       temp_dir=temp_dir,
    740   )

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/model/export_tf.py:141, in ydf_model_to_tensorflow_saved_model(ydf_model, path, input_model_signature_fn, mode, feature_dtypes, servo_api, feed_example_proto, pre_processing, post_processing, temp_dir)
    137   if input_model_signature_fn is not None:
    138     raise ValueError(
    139         "input_model_signature_fn is not supported for `tf` mode."
    140     )
--> 141   ydf_model_to_tensorflow_saved_model_tf_mode(
    142       ydf_model=ydf_model,
    143       path=path,
    144       feature_dtypes=feature_dtypes,
    145       servo_api=servo_api,
    146       feed_example_proto=feed_example_proto,
    147       pre_processing=pre_processing,
    148       post_processing=post_processing,
    149       temp_dir=temp_dir,
    150   )
    151 else:
    152   raise ValueError(f"Invalid mode: {mode}")

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/model/export_tf.py:194, in ydf_model_to_tensorflow_saved_model_tf_mode(ydf_model, path, feature_dtypes, servo_api, feed_example_proto, pre_processing, post_processing, temp_dir)
    190 # The temporary files should remain available until the call to
    191 # "tf.saved_model.save"
    192 with tempfile.TemporaryDirectory(dir=temp_dir) as effective_temp_dir:
--> 194   tf_module = ydf_model.to_tensorflow_function(
    195       temp_dir=effective_temp_dir,
    196       squeeze_binary_classification=not servo_api,
    197   )
    199   # Store pre / post processing operations
    200   # Note: Storing the raw variable allows for pre/post-processing to be
    201   # TensorFlow modules with resources.
    202   tf_module.raw_pre_processing = pre_processing

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/model/generic_model.py:808, in GenericModel.to_tensorflow_function(self, temp_dir, can_be_saved, squeeze_binary_classification)
    742 def to_tensorflow_function(  # pytype: disable=name-error
    743     self,
    744     temp_dir: Optional[str] = None,
    745     can_be_saved: bool = True,
    746     squeeze_binary_classification: bool = True,
    747 ) -> "tensorflow.Module":
    748   """Converts the YDF model into a @tf.function callable TensorFlow Module.
    749 
    750   The output module can be composed with other TensorFlow operations,
   (...)
    805     A TensorFlow @tf.function.
    806   """
--> 808   return export_tf.ydf_model_to_tf_function(
    809       ydf_model=self,
    810       temp_dir=temp_dir,
    811       can_be_saved=can_be_saved,
    812       squeeze_binary_classification=squeeze_binary_classification,
    813   )

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/model/export_tf.py:310, in ydf_model_to_tf_function(ydf_model, temp_dir, can_be_saved, squeeze_binary_classification)
    304 """Converts a YDF model to a TensorFlow function.
    305 
    306 See GenericModel.to_tensorflow_function for the documentation.
    307 """
    309 tf = import_tensorflow()
--> 310 tfdf = import_tensorflow_decision_forests()
    311 tf_op = tfdf.keras.core.tf_op
    313 # Using prefixes ensure multiple models can be combined in a single
    314 # SavedModel.

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/ydf/model/export_tf.py:381, in import_tensorflow_decision_forests()
    379 """Imports the tensorflow decision forests module."""
    380 try:
--> 381   import tensorflow_decision_forests as tfdf  # pylint: disable=g-import-not-at-top,import-outside-toplevel # pytype: disable=import-error
    383   return tfdf
    384 except ImportError as exc:

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/tensorflow_decision_forests/__init__.py:64
     60 from tensorflow_decision_forests.tensorflow import check_version
     62 check_version.check_version(__version__, compatible_tf_versions)
---> 64 from tensorflow_decision_forests import keras
     65 from tensorflow_decision_forests.component import py_tree
     66 from tensorflow_decision_forests.component.builder import builder

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/tensorflow_decision_forests/keras/__init__.py:53
     15 """Decision Forest in a Keras Model.
     16 
     17 Usage example:
   (...)
     48 ```
     49 """
     51 from typing import Callable, List
---> 53 from tensorflow_decision_forests.keras import core
     54 from tensorflow_decision_forests.keras import wrappers
     56 # Utility classes

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/tensorflow_decision_forests/keras/core.py:62
     60 from tensorflow.python.data.ops import dataset_ops
     61 from tensorflow.python.data.ops import load_op
---> 62 from tensorflow_decision_forests.component.inspector import inspector as inspector_lib
     63 from tensorflow_decision_forests.component.tuner import tuner as tuner_lib
     64 from tensorflow_decision_forests.keras import core_inference

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/tensorflow_decision_forests/component/inspector/inspector.py:64
     61 import six
     62 import tensorflow as tf
---> 64 from tensorflow_decision_forests.component import py_tree
     65 from tensorflow_decision_forests.component.inspector import blob_sequence
     66 from yggdrasil_decision_forests.dataset import data_spec_pb2

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/tensorflow_decision_forests/component/py_tree/__init__.py:20
      1 # Copyright 2021 Google LLC.
      2 #
      3 # Licensed under the Apache License, Version 2.0 (the "License");
   (...)
     12 # See the License for the specific language governing permissions and
     13 # limitations under the License.
     15 """Decision trees stored as python objects.
     16 
     17 To be used with the model inspector and model builder.
     18 """
---> 20 from tensorflow_decision_forests.component.py_tree import condition
     21 from tensorflow_decision_forests.component.py_tree import dataspec
     22 from tensorflow_decision_forests.component.py_tree import node

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/tensorflow_decision_forests/component/py_tree/condition.py:26
     22 from typing import List, Union, Optional
     24 import six
---> 26 from tensorflow_decision_forests.component.py_tree import dataspec as dataspec_lib
     27 from yggdrasil_decision_forests.dataset import data_spec_pb2
     28 from yggdrasil_decision_forests.model.decision_tree import decision_tree_pb2

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/tensorflow_decision_forests/component/py_tree/dataspec.py:24
     21 import math
     22 from typing import NamedTuple, Union, Optional, List
---> 24 from yggdrasil_decision_forests.dataset import data_spec_pb2
     26 ColumnType = data_spec_pb2.ColumnType
     28 # Special value to out of vocabulary items.

File ~/anaconda3/envs/tensorflow2_p310/lib/python3.10/site-packages/yggdrasil_decision_forests/dataset/data_spec_pb2.py:16
      9 # @@protoc_insertion_point(imports)
     11 _sym_db = _symbol_database.Default()
---> 16 DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n2yggdrasil_decision_forests/dataset/data_spec.proto\x12(yggdrasil_decision_forests.dataset.proto\"\xb9\x01\n\x11\x44\x61taSpecification\x12\x41\n\x07\x63olumns\x18\x01 \x03(\x0b\x32\x30.yggdrasil_decision_forests.dataset.proto.Column\x12\x18\n\x10\x63reated_num_rows\x18\x02 \x01(\x03\x12G\n\nunstackeds\x18\x03 \x03(\x0b\x32\x33.yggdrasil_decision_forests.dataset.proto.Unstacked\"\x95\x05\n\x06\x43olumn\x12K\n\x04type\x18\x01 \x01(\x0e\x32\x34.yggdrasil_decision_forests.dataset.proto.ColumnType:\x07UNKNOWN\x12\x0c\n\x04name\x18\x02 \x01(\t\x12\x1d\n\x0eis_manual_type\x18\x03 \x01(\x08:\x05\x66\x61lse\x12\x46\n\ttokenizer\x18\x04 \x01(\x0b\x32\x33.yggdrasil_decision_forests.dataset.proto.Tokenizer\x12J\n\tnumerical\x18\x05 \x01(\x0b\x32\x37.yggdrasil_decision_forests.dataset.proto.NumericalSpec\x12N\n\x0b\x63\x61tegorical\x18\x06 \x01(\x0b\x32\x39.yggdrasil_decision_forests.dataset.proto.CategoricalSpec\x12\x14\n\tcount_nas\x18\x07 \x01(\x03:\x01\x30\x12\x61\n\x15\x64iscretized_numerical\x18\x08 \x01(\x0b\x32\x42.yggdrasil_decision_forests.dataset.proto.DiscretizedNumericalSpec\x12\x46\n\x07\x62oolean\x18\t \x01(\x0b\x32\x35.yggdrasil_decision_forests.dataset.proto.BooleanSpec\x12O\n\x0cmulti_values\x18\n \x01(\x0b\x32\x39.yggdrasil_decision_forests.dataset.proto.MultiValuesSpec\x12\x1b\n\x0cis_unstacked\x18\x0b \x01(\x08:\x05\x66\x61lse\"\xdf\x03\n\x0f\x43\x61tegoricalSpec\x12\x1b\n\x13most_frequent_value\x18\x01 \x01(\x03\x12\x1f\n\x17number_of_unique_values\x18\x02 \x01(\x03\x12\x1a\n\x0fmin_value_count\x18\x03 \x01(\x05:\x01\x35\x12)\n\x1bmax_number_of_unique_values\x18\x04 \x01(\x05:\x04\x32\x30\x30\x30\x12\x1e\n\x16is_already_integerized\x18\x05 \x01(\x08\x12S\n\x05items\x18\x07 \x03(\x0b\x32\x44.yggdrasil_decision_forests.dataset.proto.CategoricalSpec.ItemsEntry\x12\x32\n#offset_value_by_one_during_training\x18\x08 \x01(\x08:\x05\x66\x61lse\x1ar\n\nItemsEntry\x12\x0b\n\x03key\x18\x01 \x01(\t\x12S\n\x05value\x18\x02 \x01(\x0b\x32\x44.yggdrasil_decision_forests.dataset.proto.CategoricalSpec.VocabValue:\x02\x38\x01\x1a*\n\nVocabValue\x12\r\n\x05index\x18\x01 \x01(\x03\x12\r\n\x05\x63ount\x18\x02 \x01(\x03\"b\n\rNumericalSpec\x12\x0f\n\x04mean\x18\x01 \x01(\x01:\x01\x30\x12\x11\n\tmin_value\x18\x02 \x01(\x02\x12\x11\n\tmax_value\x18\x03 \x01(\x02\x12\x1a\n\x12standard_deviation\x18\x04 \x01(\x01\"G\n\x0fMultiValuesSpec\x12\x19\n\x11max_observed_size\x18\x01 \x01(\x05\x12\x19\n\x11min_observed_size\x18\x02 \x01(\x05\"6\n\x0b\x42ooleanSpec\x12\x12\n\ncount_true\x18\x01 \x01(\x03\x12\x13\n\x0b\x63ount_false\x18\x02 \x01(\x03\"\x91\x01\n\x18\x44iscretizedNumericalSpec\x12\x16\n\nboundaries\x18\x01 \x03(\x02\x42\x02\x10\x01\x12\"\n\x1aoriginal_num_unique_values\x18\x02 \x01(\x03\x12\x1d\n\x10maximum_num_bins\x18\x03 \x01(\x03:\x03\x32\x35\x35\x12\x1a\n\x0fmin_obs_in_bins\x18\x04 \x01(\x05:\x01\x33\"\xb2\x03\n\tTokenizer\x12Y\n\x08splitter\x18\x01 \x01(\x0e\x32<.yggdrasil_decision_forests.dataset.proto.Tokenizer.Splitter:\tSEPARATOR\x12\x16\n\tseparator\x18\x02 \x01(\t:\x03 ;,\x12\x16\n\x05regex\x18\x03 \x01(\t:\x07([\\S]+)\x12\x1b\n\rto_lower_case\x18\x04 \x01(\x08:\x04true\x12N\n\x08grouping\x18\x05 \x01(\x0b\x32<.yggdrasil_decision_forests.dataset.proto.Tokenizer.Grouping\x1aS\n\x08Grouping\x12\x16\n\x08unigrams\x18\x01 \x01(\x08:\x04true\x12\x16\n\x07\x62igrams\x18\x02 \x01(\x08:\x05\x66\x61lse\x12\x17\n\x08trigrams\x18\x03 \x01(\x08:\x05\x66\x61lse\"X\n\x08Splitter\x12\x0b\n\x07INVALID\x10\x00\x12\r\n\tSEPARATOR\x10\x01\x12\x0f\n\x0bREGEX_MATCH\x10\x02\x12\r\n\tCHARACTER\x10\x03\x12\x10\n\x0cNO_SPLITTING\x10\x04\"\x97\x01\n\tUnstacked\x12\x15\n\roriginal_name\x18\x01 \x01(\t\x12\x18\n\x10\x62\x65gin_column_idx\x18\x02 \x01(\x05\x12\x0c\n\x04size\x18\x03 \x01(\x05\x12K\n\x04type\x18\x04 \x01(\x0e\x32\x34.yggdrasil_decision_forests.dataset.proto.ColumnType:\x07UNKNOWN\"\xde\x04\n\x16\x44\x61taSpecificationGuide\x12L\n\rcolumn_guides\x18\x01 \x03(\x0b\x32\x35.yggdrasil_decision_forests.dataset.proto.ColumnGuide\x12S\n\x14\x64\x65\x66\x61ult_column_guide\x18\x02 \x01(\x0b\x32\x35.yggdrasil_decision_forests.dataset.proto.ColumnGuide\x12,\n\x1dignore_columns_without_guides\x18\x03 \x01(\x08:\x05\x66\x61lse\x12\x30\n\"max_num_scanned_rows_to_guess_type\x18\x04 \x01(\x03:\x04\x31\x30\x30\x30\x12*\n\x1b\x64\x65tect_boolean_as_numerical\x18\x05 \x01(\x08:\x05\x66\x61lse\x12\x38\n)detect_numerical_as_discretized_numerical\x18\x06 \x01(\x08:\x05\x66\x61lse\x12\x39\n-max_num_scanned_rows_to_accumulate_statistics\x18\x07 \x01(\x03:\x02-1\x12\x31\n#unstack_numerical_set_as_numericals\x18\x08 \x01(\x08:\x04true\x12*\n\x1bignore_unknown_type_columns\x18\t \x01(\x08:\x05\x66\x61lse\x12\x41\n3allow_tokenization_for_inference_as_categorical_set\x18\n \x01(\x08:\x04true\"\xfc\x03\n\x0b\x43olumnGuide\x12\x1b\n\x13\x63olumn_name_pattern\x18\x01 \x01(\t\x12\x42\n\x04type\x18\x02 \x01(\x0e\x32\x34.yggdrasil_decision_forests.dataset.proto.ColumnType\x12N\n\ncategorial\x18\x03 \x01(\x0b\x32:.yggdrasil_decision_forests.dataset.proto.CategoricalGuide\x12K\n\tnumerical\x18\x04 \x01(\x0b\x32\x38.yggdrasil_decision_forests.dataset.proto.NumericalGuide\x12K\n\ttokenizer\x18\x05 \x01(\x0b\x32\x38.yggdrasil_decision_forests.dataset.proto.TokenizerGuide\x12 \n\x11\x61llow_multi_match\x18\x06 \x01(\x08:\x05\x66\x61lse\x12\x62\n\x15\x64iscretized_numerical\x18\x07 \x01(\x0b\x32\x43.yggdrasil_decision_forests.dataset.proto.DiscretizedNumericalGuide\x12\x1c\n\rignore_column\x18\x08 \x01(\x08:\x05\x66\x61lse\"\xc8\x02\n\x10\x43\x61tegoricalGuide\x12\x1e\n\x13min_vocab_frequency\x18\x01 \x01(\x05:\x01\x35\x12\x1d\n\x0fmax_vocab_count\x18\x02 \x01(\x05:\x04\x32\x30\x30\x30\x12\x1e\n\x16is_already_integerized\x18\x03 \x01(\x08\x12,\n$number_of_already_integerized_values\x18\x04 \x01(\x03\x12x\n\x1boverride_most_frequent_item\x18\x05 \x01(\x0b\x32S.yggdrasil_decision_forests.dataset.proto.CategoricalGuide.OverrideMostFrequentItem\x1a-\n\x18OverrideMostFrequentItem\x12\x11\n\tstr_value\x18\x05 \x01(\t\"\x10\n\x0eNumericalGuide\"X\n\x0eTokenizerGuide\x12\x46\n\ttokenizer\x18\x01 \x01(\x0b\x32\x33.yggdrasil_decision_forests.dataset.proto.Tokenizer\"V\n\x19\x44iscretizedNumericalGuide\x12\x1d\n\x10maximum_num_bins\x18\x01 \x01(\x03:\x03\x32\x35\x35\x12\x1a\n\x0fmin_obs_in_bins\x18\x02 \x01(\x05:\x01\x33\"\xe1\x03\n\x1c\x44\x61taSpecificationAccumulator\x12^\n\x07\x63olumns\x18\x01 \x03(\x0b\x32M.yggdrasil_decision_forests.dataset.proto.DataSpecificationAccumulator.Column\x1a\xe0\x02\n\x06\x43olumn\x12\x11\n\tkahan_sum\x18\x01 \x01(\x01\x12\x17\n\x0fkahan_sum_error\x18\x02 \x01(\x01\x12\x11\n\tmin_value\x18\x03 \x01(\x01\x12\x11\n\tmax_value\x18\x04 \x01(\x01\x12\x1b\n\x13kahan_sum_of_square\x18\x06 \x01(\x01\x12!\n\x19kahan_sum_of_square_error\x18\x07 \x01(\x01\x12\x86\x01\n\x15\x64iscretized_numerical\x18\x05 \x03(\x0b\x32g.yggdrasil_decision_forests.dataset.proto.DataSpecificationAccumulator.Column.DiscretizedNumericalEntry\x1a;\n\x19\x44iscretizedNumericalEntry\x12\x0b\n\x03key\x18\x01 \x01(\x07\x12\r\n\x05value\x18\x02 \x01(\x05:\x02\x38\x01*\xc9\x01\n\nColumnType\x12\x0b\n\x07UNKNOWN\x10\x00\x12\r\n\tNUMERICAL\x10\x01\x12\x11\n\rNUMERICAL_SET\x10\x02\x12\x12\n\x0eNUMERICAL_LIST\x10\x03\x12\x0f\n\x0b\x43\x41TEGORICAL\x10\x04\x12\x13\n\x0f\x43\x41TEGORICAL_SET\x10\x05\x12\x14\n\x10\x43\x41TEGORICAL_LIST\x10\x06\x12\x0b\n\x07\x42OOLEAN\x10\x07\x12\n\n\x06STRING\x10\x08\x12\x19\n\x15\x44ISCRETIZED_NUMERICAL\x10\t\x12\x08\n\x04HASH\x10\n')
     18 _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
     19 _builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'yggdrasil_decision_forests.dataset.data_spec_pb2', globals())

TypeError: Couldn't build proto file into descriptor pool: duplicate file name yggdrasil_decision_forests/dataset/data_spec.proto

Mathieu · April 24, 2024, 12:50pm

Thanks again for the alert.

This error should be addressed in ydf 0.4.2.

For more info, see this ydf github isssue: `to_tensorflow_function()` fails if added to the quickstart · Issue #87 · google/yggdrasil-decision-forests · GitHub