Lost when trying to get good time series prediction results (regression problem) even after trying many things

marcocintra · March 25, 2024, 5:17pm

I’m not able to get good results after a long time testing when using TensorFlow to predict time series data (regression problem). I don’t know if the problem is with the data (little quantity and/or low quality) or with the model (or if there is a problem with both) (although I have tried several and tested various combinations of hyperparameters). I’m really tired and I don’t know what else to do. What would you do if you were in my place? Where could I find someone who can help me with this problem? Thanks in advance for any help you can give me.

Ajay_Krishna · March 26, 2024, 4:01am

Time series data is always tricky. There can be many factors for poor performance.

Data Quality and Quantity: How good is the data (are there lot of missing values, outliers, consistency of time-related data). If you have a time stamp as a column then do check for any overlapping timestamps as each time stamp should be unique (try to group it w.r.t mode). If quantity is less try to use different techniques to increase the data but make sure not to lose the balance between the labels.
Model Architecture: Depends on the model you use, hyperparameters used as well as regularization techniques used. If you have less data use simple models.
Choosing appropriate evaluation metrics: As it is a regression task, use MSE, MAE and RMSE as your evaluation metrics. Try to take the confusion matrix to access the accuracy, precision, F1 score, sensitivity and specificity.
Use cross validation to avoid overfitting (depends on the data and usecase, will not help always but helps to generalize better).
Use Ensemble methods: I prefer XGboost and random forest. Random forest has this cool feature to extract feature importance and you can calculate the features that are most affecting or least affecting the model performance. Try removing the least 5 features to improve the results. Try to avoid features having same data. Even if you have less features don’t hesitate to remove least performing features as they are the main culprits of poor performance.
Splitting the dataset: Do not split the dataset in percentages, as it is time series data try to split the data with respect to time (align your data w.r.t time and split accordingly, do not randomly split the data) this will avoid lot of errors and model can generalize better on the data.

I have worked on a lot of time series data and would love to help you out with more details.

Igor_Lessio · March 29, 2024, 8:45am

Start with telling us what model u use, What lags and targets and how many timesteps.
Probably also a df.info() to see if u did a good or poor job in the feature engineering will help.

marcocintra · April 20, 2024, 11:17pm

Hello, thank you very much for your help. I couldn’t respond earlier, I’m sorry. My data are time series of maps of an ionosphere physical variable (TEC), there are 969,612 maps in total, I am using 1 stride sliding window and use 97 4x4 maps (characterize 1 day) as input, to predict 97 maps as well. Initial shape of X and Y before splitting between X_train, Y_train, X_val, Y_val and X_test, Y_test: (969419, 194, 4, 4). I’m testing CONVLSTM, but I’m not having success. I’ve been focusing more on this model in general: ConvLSTM/2. Run_ConvLSTM_Save_model.py at main · jsh4887/ConvLSTM · GitHub. Thanks!

marcocintra · April 20, 2024, 11:20pm

Hello, thank you very much for your help. I couldn’t respond earlier, I’m sorry. Just now I answered Ajay’s comment (https://discuss.tensorflow.org/t/lost-when-trying-to-get-good-time-series-prediction-results-regression-problem-even-after-trying-many-things/23582/6?u=marcocintra) which answers most of what you want to know, including the type of model I’m using. I wouldn’t be able to execute df.info() because I’m using numpy arrays, but I already gave some information about my data in the mentioned comment and I can also say that there are no null or negative values.

Igor_Lessio · April 21, 2024, 1:44am

What is the target variable ? The whole map ?

marcocintra · April 21, 2024, 1:50am

Yes, but the maps for the whole day, that is, 97 maps.

Igor_Lessio · April 21, 2024, 2:19am

So you want the model to predict map in a sequence of 97 of them. LSTM then to output a sequence. Also you already prepared the Numpy array but did you made feature engineering before ?
What is a map. A 2d array ?

Igor_Lessio · April 21, 2024, 2:21am

1 lag only think u need more. I do 28 minimum on my TS predictions and I should go 4x that even.

marcocintra · April 21, 2024, 2:21am

I believe I haven’t done feature engineering, I know little about the subject, can you explain to me what I should do please? Yes, each map is a 2D 4x4 array.

marcocintra · April 21, 2024, 2:22am

Maybe we’re not talking about the same thing. When I said I use 1 stride I’m talking about a sliding window. When making the sliding window of size 97*2, I make a stride of 1 map for each window.