Multiple features with a mix of oneHot and floats in the same training

Math · May 2, 2023, 1:48am

Hi all,

I’m new to tf and I mostly use the nodejs api.
I wanted to make sure I will implement correctly my use case. So please confirm I’m doing it well :

in my data, I have a float value that will be [0-1] normalized
I also have another float value that I can normalize as well on a distinct scale [0-1]
I then have 3 categorical values that I’m able to oneHot encode each of them separately (e.g.: cat1 [0, 0, 1] … cat2 [0, 0, 0, 1] … cat3 [0, 1] …)
and I also have a multi value, oneHot encoded [0, 0, 0, 1] …

I believe I can say I have 5 features + 1 label encoded along 4 classes.
Hence it seem the shape of feature is not the same as the classes shape. And I really don’t know how to to make the nodejs code so that it can handle all features with the label.
So far, the only trainable model is when I only keep 1 float feature + the oneHot label. But I’m not able to have multiple features with a mix of oneHot and floats in the same training.
I do not find online tutorial to help me understand how I can do that correctly. Any idea ?

Thanks a lot.

Jason · May 2, 2023, 11:21pm

Welcome to the forum @Math and thanks for being part of the Web ML / Web AI community with TensorFlow.js node!

So seems like you are off to a great start by normalizing your input features before training. This can prevent huge scale differences between those input features.

For your 1 hot encodings, I just want to check you are doing that correctly. Essentially if you have 3 categories (possible classes) for your label, so your 1 hot would look like more like this:

[1, 0, 0] = Cat 1
[0, 1, 0] = Cat 2
[0, 0, 1] = Cat 3

In your example your length changes for some reason. I cover 1 hot in more detail in my course here:

If you had 4 classification classes to 1-hot encode then it would be the same but with length 4 instead with the 1 changing position representing each different class type.

Please check the video linked above for how I got 1-hot encodings working for multiclass classification problem. You can then write similar code for Node.js instead using TensorFlow.js too even though mine is for the front end browser environment - it is all the same JavaScript essentially on both front end and Node for something like this.

Once you have seen the video above then watch the following video to see how I code this up with a simple multi layer perceptron:

While MLP will not be great - more likely you will use a CNN, it shows the code to correctly format your inputs / 1-hot encodings.

You can then use my CNN video to convert your code to work with a Convolutional Neural Net for much better image classification performance:

And then finally you can also see this video to use a more powerful CNN by using a popular base model called MobileNet to achieve really superior performance:

Hope that helps!

Math · May 3, 2023, 6:54pm

Yes it does help. I mean I’ll have a video session on my spare time this friday.
In the mean time I confirm I believe my oneHot looks correct as per your clarification. My initial message was a little bit weird (about " * I then have 3 categorical values that I’m able to oneHot encode each of them separately (e.g.: cat1 [0, 0, 1] … cat2 [0, 0, 0, 1] … cat3 [0, 1] …)"). but if I reformulate and I do believe this is what you said
I should have said :

I have 3 categorical values
** the first is a class 3 (each rows on that value are onhot encoded on the following: [0, 0, 1] or [0, 1, 0] or [1, 0, 0])
** the first is a class 4 (each rows on that value are onhot encoded on the following: [1, 0, 0, 0] or [0, 1, 0, 0] or [0, 0, 1, 0] or [0, 0, 0, 1])
** the first is another class 3 (each rows on that value are onhot encoded on the following: [0, 0, 1] or [0, 1, 0] or [1, 0, 0]) but refers to a distinct onehot encoding as the first categorical value above
** the first is a class 3 (each rows on that value are onhot encoded on the following: [0, 0, 1] or [0, 1, 0] or [1, 0, 0])

I will get back with additional questions once I’ll review the videos because I might still have an implementation issue. (guess : issue probably on the shape of tensors according to my latest codes). I still need to learn before asking more detailed question.

BTW : thank a lot to have spent time answering my post.

Jason · May 3, 2023, 7:38pm

No worries, let me know how you get on after watching the videos / taking the course at Machine Learning for Web Developers (Web ML) - YouTube

Math · May 8, 2023, 11:41am

I have been watching several videos from the series above. They seems really crystal clear. Unfortunately, I could find (yet) the fix for my issue.
I’m working on a simplified data use case.

For each 15 rows of data, I have :

a normalized float value
a oneHot value [x, x, x]
a oneHot label [x, x, x, x]

I’m really stuck on transforming my data into the correct shape for a training dataset or maybe the model itself is not correctly set!

I currently got the following error :

ValueError: Input 0 is incompatible with layer dense_Dense1: expected min_ndim=2, found ndim=1.

const tf = require("@tensorflow/tfjs-node");

let data = [
	{ value: 95.1, categorical_value_as_string: "cat 1", label: "string label" },
	{ value: 5.1, categorical_value_as_string: "cat 2", label: "string label" },
	{ value: 0.0, categorical_value_as_string: "cat 3", label: "string label" },
	{ value: 94.1, categorical_value_as_string: "cat 1", label: "string label" },
	{ value: 100.0, categorical_value_as_string: "cat 2", label: "string label" },
	{ value: 12.2, categorical_value_as_string: "cat 3", label: "string label" },
	{ value: 21.5, categorical_value_as_string: "cat 3", label: "string label" },
	{ value: 18.8, categorical_value_as_string: "cat 2", label: "string label" },
	{ value: 82.6, categorical_value_as_string: "cat 1", label: "string label" },
	{ value: 23.2, categorical_value_as_string: "cat 2", label: "string label" },
	{ value: 46.2, categorical_value_as_string: "cat 3", label: "string label" },
	{ value: 46.9, categorical_value_as_string: "cat 1", label: "string label" },
	{ value: 71.5, categorical_value_as_string: "cat 1", label: "string label" },
	{ value: 55.1, categorical_value_as_string: "cat 2", label: "string label" },
	{ value: 38.2, categorical_value_as_string: "cat 3", label: "string label" }
];
let min = 0.0;
let max = 100.0;
let batchSize = 50;
let epochs = 100;
let splitIdx = 10; // take the 10 first for training

let features = ["value", "categorical_value_as_string"]; // because I need to be flexible on the features list
let categoricalFeats = ["cat 1", "cat 2", "cat 3"];		 // 3 classes in that feature ; it will be oneHotEncoded
let labelsClasses = ["class 1", "class 2", "class 3", "class 4"]; // actual label for ML ; it will be oneHotEncoded

const buildModel = async function() {
	return await new Promise((resolve) => {
		const model = tf.sequential();
		model.add(tf.layers.dense({
			inputShape: [features.length], // 2
			units: 1,
			activation: "relu"
		}));
		model.add(tf.layers.dense({
			units: labelsClasses.length, // 4
			activation: "sigmoid"
		}));
		model.compile({
			optimizer: tf.train.adam(0.001),
			loss: "meanSquaredError",
			metrics: ['accuracy']
		});
		console.debug("Model.weights:");
		model.weights.forEach(w => {
			console.debug(" ", w.name, w.shape);
			/*
			Model.weights:
			dense_Dense1/kernel [ 2, 16 ]
			dense_Dense1/bias [ 16 ]
			dense_Dense2/kernel [ 16, 4 ]
			dense_Dense2/bias [ 4 ]
			*/
		});
		resolve(model);
	});
};

const oneHotEncode = (classIndex, classes) => {
	return tf.oneHot(classIndex, classes.length).dataSync();
};

const x = data.map((r, index) => {
	return features.map((f) => {
		if (f === "value") {
			return parseFloat((r.value - min)/(max - min), 10); // normalize
		} else if (f === "categorical_value_as_string") {
			return oneHotEncode(categoricalFeats.indexOf(r.categorical_value_as_string), categoricalFeats);
		}
	});
});

const y = data.map((r, index) => {
	return oneHotEncode(labelsClasses.indexOf(r.label), labelsClasses);
});

const featureTensor = tf.tensor(x);
const labelTensor = tf.tensor(y);

x.map((f, index) => {
	console.log(`x ${index}`, f);
});
			
const ds = tf.data
	.zip({ xs: tf.data.array(featureTensor), ys: tf.data.array(labelTensor) })
	.shuffle(data.length);

buildModel().then((model) => {
	console.log("MODEL BUILT");
	model.fitDataset( (ds.take(splitIdx).batch(batchSize)), {epochs: epochs})
	.then((trained) => {
		console.log("TRAINED");
		return trained;
	});
});

Can you give me some hints to debug ?

Jason · May 8, 2023, 10:48pm

Ah thanks for the code example that makes it much easier to understand what you are doing.

So for others who may be searching who may find this useful, what you essentially have is a some continuous value (a floating point number) as one input feature, and then a 1 hot encoded feature representing a class of something, both of which are inputs that you want to use to then train the network with that data.

I believe what you would need to do in this case is take your float value eg let’s pretend one value in your training data was [485.948948] and ensure it is normalized to prevent it looking huge vs the 1 hot encoded values which are just zeros and ones, and then concatenate with the 1 hot encoding.

Eg if your 1 hot encoding of some class was [1,0,0] and the normalized value of your float was [0.56848] then you would end up with a 1D tensor of the form [1,0,0,0.56848] so a 1D tensor of size of 5 to use as your input to the network to train on.

I cover normalization in JavaScript here incase you (or others) need to check that out:

Be sure to record your min/max values so you can normalize data to then use in predictions later too once the model is trained as you will need to normalize in the same way for predictions too later, not just for training using the same min/max found in the training data set.

Math · May 9, 2023, 9:42pm

Thank you again Jason.

you would end up with a 1D tensor

I’m not 100% sure this sentence solved my issue or not… But I got something usable for my proof of concept.

I wanted to share what I “succeed” ; at least there is no output error. No idea if the training is correct. And data is full random sample.

Maybe you’d like to see it in action ? :

This script is taking 2 parameters categoricalFeats and continuousFeats as coma separated values to define the list of features to be injected into the model.

Again, I just wanted to make sure I’m able to manage both categorical (oneHot) and continuous (float only) values on the same dataset for a training.
Any feedback or fix is welcome.

Jason · May 11, 2023, 10:39pm

FYI TensorFlow.js API has a 1 hot function:

You can use that instead of implementing yourself which is faster and more performant too most likely.

Basically your training data should be of the form:

[normalizedValue, normalizedValue2, oneHotfeature1, oneHotfeature2, oneHotfeature3]

So the final form would look something like this:

[0.9, 0.8, 1,0,0, 0,1, 1,0,0 ]

I put spaces showing each part but obviously this is 1 big 1D array of numbers.

The 2 normalized values at the start, then the one hot encoding of feature 1 which I think you have 3 possible values for, then the binary feature that has 2 possible values, and then feature 3 that also has 3 possible values.

That would be 1 training data input. So you would have a 2D array with all the training data each lines formatted like the above for each training data row.