Tfdf on sagemaker printing the loss for each iteration

ahad · June 13, 2023, 3:54pm

Hi,
I’m using tfdf.keras.GradientBoostedTreesModel with verbos=2, added several metrics: model.compile(metrics=[‘binary_crossentropy’, ‘mse’, ‘AUC’, ‘accuracy’]) , trained with 500 num_trees.

I would like to see the train loss vs the validation loss for each iteration (tree in my case)

I have here 2 problems:
1- In the logs I can’t see the loss after each tree addition, I rather see logs after building few trees.
2- I won’t see all the metrics I added (saw here that it’s not supported)

2023-06-13T13:57:10.553+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:[1,mpirank:0,algo-1]:1047] 600000 examples used for training and 200000 examples used for validation[1,mpirank:0,algo-1]:

2023-06-13T13:57:11.553+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1430] #011num-trees:1 train-loss:0.262175 train-accuracy:0.965992 valid-loss:0.266117 valid-accuracy:0.965540

2023-06-13T13:57:11.553+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:2 train-loss:0.255583 train-accuracy:0.966120 valid-loss:0.262195 valid-accuracy:0.965565

2023-06-13T13:57:41.563+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:68 train-loss:0.176204 train-accuracy:0.971425 valid-loss:0.263119 valid-accuracy:0.966120

2023-06-13T13:58:12.573+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:127 train-loss:0.133686 train-accuracy:0.976682 valid-loss:0.274393 valid-accuracy:0.965830

2023-06-13T13:58:42.583+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:186 train-loss:0.100317 train-accuracy:0.982568 valid-loss:0.286893 valid-accuracy:0.965585

2023-06-13T13:59:12.595+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:242 train-loss:0.076956 train-accuracy:0.987472 valid-loss:0.300240 valid-accuracy:0.965450

2023-06-13T13:59:42.604+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:293 train-loss:0.061192 train-accuracy:0.991143 valid-loss:0.311722 valid-accuracy:0.965425

2023-06-13T14:00:12.613+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:347 train-loss:0.048823 train-accuracy:0.994160 valid-loss:0.324167 valid-accuracy:0.965365

2023-06-13T14:00:43.624+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:408 train-loss:0.038996 train-accuracy:0.996265 valid-loss:0.337411 valid-accuracy:0.965270

2023-06-13T14:01:13.634+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1432] #011num-trees:470 train-loss:0.031601 train-accuracy:0.997723 valid-loss:0.350617 valid-accuracy:0.965165

2023-06-13T14:01:28.639+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:1430] #011num-trees:500 train-loss:0.028782 train-accuracy:0.998177 valid-loss:0.357087 valid-accuracy:0.965185

2023-06-13T14:01:28.639+03:00 [1,mpirank:0,algo-1]:[INFO gradient_boosted_trees.cc:264] Final model num-trees:10 valid-loss:0.357087 valid-accuracy:0.965185

2023-06-13T14:01:28.639+03:00 [1,mpirank:0,algo-1]:[INFO kernel.cc:957] Export model in log directory: /opt/adva/checkpoints/tfdf with prefix 8aef82d8e4434fb8

I tried to used the inspector model.make_inspector().training_logs(), however i encountered a problem of retrieving the model, please find attached log:

[1,mpirank:0,algo-1]:The model at /opt/adva/checkpoints/tfdf/v0/model/ contains multiple YDF models. Please specify the prefix of the intended model. Available prefixes: [‘72a694e546584792’, ‘3d9f0b2ad7984337’, ‘a39992b7f3aa4541’, ‘bdf53386edd64b97’, ‘b729c3f939b14592’, ‘8aef82d8e4434fb8’, ‘7b0eeab2e4e840a3’, ‘cadda57c48874500’, ‘8aeda86d45964712’, ‘0ecb05d742d24dce’, ‘44e569ec76d742ea’, ‘325fb31fa3b54dd1’, ‘3307f0d3ab014532’, ‘77cf5988e9e64ce8’, ‘7f2654b72f7c4b6a’, ‘d4013745082f4abe’, ‘18e7f5be5f8840c8’]

what are these multiple YDF models (the dir is new so how come I have several models there)?

I noticed that after fitting the model is exported without me saving it (the log: Export model in log directory: /opt/adva/checkpoints/tfdf/v0 with prefix cadda57c48874500 ) Is there a way to determine the log prefix of the model and to pass it to the inspector?

Any idea how can I solve my problem and see logs for each tree?

Thanks

Laxma_Reddy_Patlolla · June 13, 2023, 10:46pm

Hi @ahad ,

After carefully reviewing your post, I would like to summarize the key points and provide some suggestions to address your challenges.

To see the loss after each tree addition, you can enable verbose logging during training. You can set the verbosity parameter to a higher value, such as 3.This should provide more detailed logs during training.
You can try specifying the desired model prefix when calling the training_logs()

logs = model.make_inspector().training_logs(prefix='72a694e546584792')

I hope this helps!

Thanks.