Numeric and categorical features columns

Bealtaine · December 22, 2022, 3:43pm

Hi,
I’m starting to take my first steps with TensorFlow and trying to understand the division of columns into numeric and categorical in this example:

Could you explain me why columns: ‘n_siblings_spouses’ and ‘parch’ aren’t numeric?

(This is the part of the code in the example:)

CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck',
                       'embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']

feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = dftrain[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

nateburley · December 23, 2022, 2:43am

Hey Bealtine,

Those columns are considered categorical because, while the actual values are numeric, they are in fact referring to discrete categories, specifically numbers of people. For a simple practical example, say that there were some missing values in one of those columns, and you wanted to fill them with something. If you wanted to use the mean value, you might end up with something nonsensical, like 2.3 parents and children (Parch)!

As a general rule, if you are dealing with discrete entities, then you treat them as categorical values (even if the categories are numbers). But if it’s valid for the numbers themselves to fall on a continuum (such as with values like price or age), then they would be considered numeric.

Hope this helps!

Bealtaine · January 2, 2023, 7:41pm

Thank you very much. I see it now!