Avoid data leakage from train to test in tensorflow dataset/Splitting an tensorflow dataset into train and test without data leakage

mksakeesh · August 9, 2022, 1:06am

I am using below code to read from csv to tensorflow dataset

ratings_ds = tf.data.experimental.make_csv_dataset(
    "./train_recom_transformed.csv",
    batch_size=5,
    select_columns=['user_id', 'song_id', 'listen_count', 'ratings','title','release','artist_name','year','count'],
    header=True,
    num_epochs=1,
    ignore_errors=False,)
songs_ds = tf.data.experimental.make_csv_dataset(
    "./songs_details.csv",
    batch_size=128,
    select_columns=['song_id','title','release','artist_name','year'],
    num_epochs=1,
    ignore_errors=True,)


ratings = ratings_ds.unbatch().map(lambda x: {
    "song_id": x["song_id"],
    "user_id": x["user_id"],
    "ratings": x["ratings"],
    "release":x["release"],
    "artist_name":x["artist_name"],
    "title":x["title"],
    "year":x["year"],
    "listencount":x["listen_count"],
    "count":x["count"],
})
songs = songs_ds.unbatch().map(lambda x: {
    "song_id":x["song_id"],
    "release":x["release"],
    "artist_name":x["artist_name"],
    "title":x["title"],
    "year":x["year"],
})

train = ratings.take(12000)
test = ratings.skip(12000).take(4000)

In this code how can I ensure that the same user id is not there in both train and test dataset. How can I avoid data leakage from train to test?

I did try sorting the csv file but then when reading into tensorflow dataset the sorting is lost.

rcauvin · August 13, 2022, 9:35pm

Are you preparing to train and test a recommender model?

lgusm · August 15, 2022, 9:21pm

From the top of my head,

I’d preprocess this data (in Pandas) and save separated files for train and test following whatever group you want. In your case by user_id

This makes it easier to keep playing with the model later as the data is already properly split in files

mksakeesh · August 18, 2022, 1:28am

Yes I am trying an recommendation model.

rcauvin · August 18, 2022, 4:20pm

I’m not quite sure why you don’t want the same user ID to appear in the train and test datasets. The ratings dataset represents ratings on user-item pairs. Thus the same user may appear multiple times in the dataset, rating different items. Typically, you want the model to learn the user’s item preferences in the train dataset and predict whether the same user will like a different item in the test dataset.