Why does the one-hot-encoding give worse accuracy in this case?

South_Asia · December 6, 2023, 8:40pm

I have two directories, train_data_npy and valid_data_npy where there are 3013 and 1506 *.npy files, respectively.

Each *.npy file has 11 columns of float types, of which the first eight columns are features and the last three columns are one-hot-encoded labels (characters) of three classes.

----------------------------------------------------------------------
f1      f2      f3      f4   f5   f6   f7   f8          ---classes---
----------------------------------------------------------------------
0.0     0.0     0.0     1.0  1.0  1.0  1.0  1.0         0.0  0.0  1.0
6.559   9.22    0.0     1.0  1.0  1.0  1.0  1.0         0.0  0.0  1.0
5.512   6.891   10.589  0.0  0.0  0.0  0.0  1.0         0.0  0.0  1.0
7.082   8.71    7.227   0.0  0.0  0.0  0.0  0.0         0.0  0.0  1.0
6.352   9.883   12.492  0.0  0.0  0.0  0.0  0.0         0.0  0.0  1.0
6.711   10.422  13.44   0.0  0.0  0.0  0.0  1.0         0.0  0.0  1.0
7.12    9.283   12.723  0.0  0.0  0.0  0.0  0.0         0.0  0.0  1.0
6.408   9.277   12.542  0.0  0.0  0.0  0.0  0.0         0.0  0.0  1.0
6.608   9.686   12.793  0.0  0.0  0.0  0.0  0.0         0.0  0.0  1.0
6.723   8.602   12.168  0.0  0.0  0.0  0.0  0.0         0.0  0.0  1.0
... ... ... ... ...

Given the format of the data, I have written two scripts.

cnn_autokeras_by_chunk_with_ohe.py uses OHE labels as they are, and cnn_autokeras_by_chunk_without_ohe.py converts OHE data into integers.

The first one achieves an accuracy of 0.40, and the second one achieves an accuracy of 0.97.

Why does the one-hot-encoding give worse accuracy in this case?

The Python script’s task is to load those *.npy files in chunks so that the memory is not overflowed while searching for the best model.

# File: cnn_autokeras_by_chunk_with_ohe.py
import numpy as np
import tensorflow as tf
import autokeras as ak
import os

# Update these values to match your actual data
N_FEATURES = 8
N_CLASSES = 3  # Number of classes
BATCH_SIZE = 100

def get_data_generator(folder_path, batch_size, n_features, n_classes):
    """Get a generator returning batches of data from .npy files in the specified folder.
    The shape of the features is (batch_size, n_features).
    The shape of the labels is (batch_size, n_classes).
    """
    def data_generator():
        files = os.listdir(folder_path)
        npy_files = [f for f in files if f.endswith('.npy')]

        for npy_file in npy_files:
            data = np.load(os.path.join(folder_path, npy_file))
            x = data[:, :n_features]
            y = data[:, n_features:]

            for i in range(0, len(x), batch_size):
                yield x[i:i+batch_size], y[i:i+batch_size]

    return data_generator

train_data_folder = '/home/my_user_name/original_data/train_data_npy'
validation_data_folder = '/home/my_user_name/original_data/valid_data_npy'

train_dataset = tf.data.Dataset.from_generator(
    get_data_generator(train_data_folder, BATCH_SIZE, N_FEATURES, N_CLASSES),
    output_signature=(
        tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32),
        tf.TensorSpec(shape=(None, N_CLASSES), dtype=tf.float32)  # Labels are now 2D with one-hot encoding
    )
)

validation_dataset = tf.data.Dataset.from_generator(
    get_data_generator(validation_data_folder, BATCH_SIZE, N_FEATURES, N_CLASSES),
    output_signature=(
        tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32),
        tf.TensorSpec(shape=(None, N_CLASSES), dtype=tf.float32)  # Labels are now 2D with one-hot encoding
    )
)

# Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(max_trials=10) # Set max_trials to any value you desire.

# Feed the tensorflow Dataset to the classifier.
clf.fit(train_dataset, epochs=100)

# Get the best hyperparameters
best_hps = clf.tuner.get_best_hyperparameters()[0]

# Print the best hyperparameters
print(best_hps)

# Export the best model
model = clf.export_model()

# Save the model in tf format
model.save("heca_v2_model_with_ohe", save_format='tf')  # Note the lack of .h5 extension

# Evaluate the best model with testing data.
print(clf.evaluate(validation_dataset))

# File: cnn_autokeras_by_chunk_without_ohe.py
import numpy as np
import tensorflow as tf
import os
import autokeras as ak

N_FEATURES = 8
N_CLASSES = 3  # Number of classes
BATCH_SIZE = 100

def get_data_generator(folder_path, batch_size, n_features):
    """Get a generator returning batches of data from .npy files in the specified folder.

    The shape of the features is (batch_size, n_features).
    """
    def data_generator():
        files = os.listdir(folder_path)
        npy_files = [f for f in files if f.endswith('.npy')]

        for npy_file in npy_files:
            data = np.load(os.path.join(folder_path, npy_file))
            x = data[:, :n_features]
            y = data[:, n_features:]
            y = np.argmax(y, axis=1)  # Convert one-hot-encoded labels back to integers

            for i in range(0, len(x), batch_size):
                yield x[i:i+batch_size], y[i:i+batch_size]

    return data_generator

train_data_folder = '/home/my_user_name/original_data/train_data_npy'
validation_data_folder = '/home/my_user_name/original_data/valid_data_npy'

train_dataset = tf.data.Dataset.from_generator(
    get_data_generator(train_data_folder, BATCH_SIZE, N_FEATURES),
    output_signature=(
        tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32),
        tf.TensorSpec(shape=(None,), dtype=tf.int32)  # Labels are now 1D integers
    )
)

validation_dataset = tf.data.Dataset.from_generator(
    get_data_generator(validation_data_folder, BATCH_SIZE, N_FEATURES),
    output_signature=(
        tf.TensorSpec(shape=(None, N_FEATURES), dtype=tf.float32),
        tf.TensorSpec(shape=(None,), dtype=tf.int32)  # Labels are now 1D integers
    )
)

# Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(max_trials=10) # Set max_trials to any value you desire.

# Feed the tensorflow Dataset to the classifier.
clf.fit(train_dataset, epochs=100)

# Get the best hyperparameters
best_hps = clf.tuner.get_best_hyperparameters()[0]

# Print the best hyperparameters
print(best_hps)

# Export the best model
model = clf.export_model()

# Save the model in tf format
model.save("heca_v2_model_without_ohe", save_format='tf')  # Note the lack of .h5 extension

# Evaluate the best model with testing data.
print(clf.evaluate(validation_dataset))

BadarJaffer · December 8, 2023, 12:31pm

Possible Reasons for Difference in Accuracy could be:

Autokeras might find different model architectures and hyperparameters more suitable for predicting integer labels compared to OHE labels. The nature of how the label information is presented to the model can influence its learning.
The conversion from OHE to integers changes the representation of the label information. Neural networks may learn differently when dealing with one-hot-encoded vectors versus single integers.

or

3.The choice of loss function could also impact the training process. Different loss functions are suitable for different types of label representations.

Try adjusting hyperparameters like the learning rate, the number of layers, or the number of nodes in each layer. Or try increasing the number of trials beyond 10 to explore a broader range of architectures.

@South_Asia