Data pre-processing questions

Hi all,
Newby here.
I have done some TF training but find usually examples do not quite match the data I am exposed to… So wondering if anyone can provide high level guidance in terms of pre-processing.

I am working with a data set in CSV file that represents events and include both integers, dates, durations and categories…
I am trying to build a simple classification model (based on my inputs represented by locations, times, durations, I am trying to predict my target to be either an event of category A or B)
I understand that categories need to be converted to numerical values so am using one-hot encoding for that.
Now, am I correct in understanding so all the inputs of my model need to be converted to numerical values? (dates which are in 'DD/MM/YYY), durations which are in (00:00:00) formats?

Also, do I need to scale the entire training set once I have done all the encoding?

Many thanks in advance,

Lars

Consider using preprocessing layers to do the encoding, as described here. You can even include those preprocessing layers in your model, so that the encoding will happen seamlessly during both training and inference.

For durations in hh:mm:ss format, you’ll probably want to convert them to seconds and then use a normalization layer to shift and scale the numbers to be centered around zero.

For dates, you need to decide whether the actual dates matter, or if what matters is the seasonality. For example, the day of week, day of year, or month of year might matter more than the full date. If so, consider converting dates to these seasonal values (e.g. 0 to 7 for day of week) and using a cyclical integer layer such as the one I implemented here.

2 Likes

Hi @rcauvin , thanks a lot, that’s helpful advice.

Regarding dates, you are right in that I am probably looking more for seasonality impact that the actual dat itself. I will look into that and also preprocessing layers.

Thanks

1 Like