Improve basic code

Hello,
I was hoping I could receive suggestion about improving this Keras model implemented in R and using the framingham kaggle dataset. It’s a simple binary classification example.

The issues are: training loss is 0 and accuracy only 63%, is there something I have got essentially wrong in the code and if not, what can be done to improve accuracy?

I understand this case may be better suited to logistic regression but I’m interested to gain a feel for why a keras ANN is not the right choice, welcome constructive comments please.

The dataset is available on Kaggle and my code is below.

Regards

library(keras)
library(dplyr)

<<load kaggle csv>> [here](https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset/data)

# Remove na values
heart <- na.omit(heart)

# Separate numeric and categorical variables
columns_to_scale <- c("age", "cigsPerDay", "totChol", "sysBP", "diaBP", 
                      "BMI", "heartRate", "glucose")

columns_to_numeric <- c("male", "currentSmoker", "BPMeds",
                           "prevalentStroke", "prevalentHyp", "diabetes")

# Calculate mean and standard deviation
mean_values <- apply(heart[columns_to_scale], 2, mean, na.rm = TRUE)
sd_values <- apply(heart[columns_to_scale], 2, sd, na.rm = TRUE)

# Standardize the specified vectors
heart[columns_to_scale] <- scale(heart[columns_to_scale], center = mean_values, scale = sd_values)
heart[columns_to_numeric] <- lapply(heart[columns_to_numeric], as.numeric)

# Separate dependent and independent variables
target <- heart$TenYearCHD
features <- heart %>% select(-TenYearCHD)

# Create one-hot encoding for categorical variables
# features[columns_to_categorize] <- lapply(features[columns_to_categorize], as.factor)

# Create train and test data
set.seed(123)
split_index <- sample(1:nrow(heart), 0.8 * nrow(heart))

train_data <- cbind(as.matrix(features[columns_to_scale]),
                    as.matrix(as.data.frame(features[columns_to_numeric])))[split_index, ]

train_target <- target[split_index]

test_data <- cbind(as.matrix(features[columns_to_scale]),
                   as.matrix(as.data.frame(features[columns_to_numeric])))[-split_index, ]

test_target <- target[-split_index]

# Compute class weights
class_weights <- 1 / table(train_target)

class_weights_list <- as.list(class_weights)

# Create neural network model
model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = ncol(train_data)) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid", name = "output_layer")

# Compile the model
model %>% compile(
  optimizer = optimizer_adam(lr = 0.1),
  loss = "binary_crossentropy",
  metrics = "accuracy"
)

# Train the model
history <- model %>% fit(
  train_data,
  train_target,
  epochs = 50,
  batch_size = 32,
  validation_split = 0.2,
  class_weight = class_weights_list
)

evaluation <- model %>% evaluate(
  as.matrix(test_data),
  as.numeric(test_target)
)
print(evaluation)

predictions <- model %>% predict(as.matrix(test_data))
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

library(caret)
confusion_matrix <- confusionMatrix(as.factor(predicted_classes), as.factor(test_target))
print(confusion_matrix)

To improve your Keras model for binary classification using the Framingham dataset in R, consider the following steps:

  1. Data Preprocessing:
  • Ensure thorough preprocessing: Check if all features are correctly scaled and whether you need to engineer or drop any features.
  • Address class imbalance: Use techniques like oversampling, undersampling, or class weights if your target variable is imbalanced.
  1. Model Architecture:
  • Adjust your network: Experiment with different numbers of layers, neurons, and activation functions (like ReLU). Add dropout layers to prevent overfitting.
  1. Training Process:
  • Optimize loss function and optimizer: Use binary_crossentropy for loss and try different optimizers like Adam.
  • Adjust learning rate and implement early stopping.
  1. Evaluation and Validation:
  • Use additional metrics like precision, recall, F1-score, and ROC-AUC for a comprehensive performance evaluation.
  • Validate your model using a separate validation set or cross-validation.
  1. Hyperparameter Tuning:
  • Conduct hyperparameter tuning with grid search or random search to find the best model parameters.
  1. Model Suitability:
  • Neural networks excel in complex, large-scale problems. If a neural network doesn’t outperform simpler models like logistic regression in your case, it might indicate that the problem doesn’t require such complexity.

Remember, machine learning is an iterative process, and gradual changes with evaluations are key to understanding and improving your model’s performance.

Hello @Tim_Wolfe many thanks for the suggestions. I’ll try and code the grid search as a next step to look at options.