Dataset Count and Group By

rcauvin · December 27, 2023, 7:34pm

I have a dataset defined as follows:

user_ids = [“user1”, “user2”, “user3”, “user3”, “user1”, “user2”, “user1”, “user2”, “user3”, “user1”]
item_ids = [“item1”, “item2”, “item1”, “item1”, “item2”, “item3”, “item2”, “item2”, “item1”, “item1”]
ratings = [0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
ds = tf.data.Dataset.from_tensor_slices({“user_id”: user_ids, “item_id”: item_ids, “rating”: ratings})

I want to calculate a dataset result_ds with each item ID and the number unique users that rated the item a ‘1’. I’d like the solution to be able to handle tens of millions of records and not just this small example.

The result should look something like:

for element in result_ds:
print(element)

{‘item_id’: <tf.Tensor: shape=(), dtype=string, numpy=b’item1’>, ‘positive_count’: <tf.Tensor: shape=(), dtype=int64, numpy=2>}
{‘item_id’: <tf.Tensor: shape=(), dtype=string, numpy=b’item2’>, ‘positive_count’: <tf.Tensor: shape=(), dtype=int64, numpy=2>}
{‘item_id’: <tf.Tensor: shape=(), dtype=string, numpy=b’item3’>, ‘positive_count’: <tf.Tensor: shape=(), dtype=int64, numpy=1>}

I tried using ds.group_by_window but couldn’t get it to work.

Kiran_Sai_Ramineni · December 29, 2023, 6:29am

Hi @rcauvin, I tried to achieve the desired output with the alternative approach. Please find the workaround below:

I have created a dataset using

user_ids = ['user1', 'user2', 'user3', 'user3', 'user1', 'user2', 'user1', 'user2', 'user3', 'user1']
item_ids = ['item1','item2', 'item1', 'item1', 'item2', 'item3', 'item2', 'item2', 'item1', 'item1']
ratings = [0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
ds = tf.data.Dataset.from_tensor_slices({'user_id': user_ids, 'item_id': item_ids, 'rating': ratings})

After creating a dataset i have extracted all the positive rating using

positive_ds = ds.filter(lambda x: x["rating"] == 1)

Now i have taken items and ratings into 2 separate list

 for element in positive_ds.as_numpy_iterator():
  binary_item_list.append(element['item_id'])
  rating.append(element['rating'])

#output
binary_item_list [b'item2', b'item1', b'item1', b'item2', b'item1']
rating [1, 1, 1, 1, 1]

Now i have extracted items total positives using

result_dict = defaultdict(int)
item_ratings = []

for item_a, item_b in zip(binary_item_list, rating):
    result_dict[item_a] += item_b

for item, count in result_dict.items():
    item_ratings.append((item.decode('utf-8'), count))

#output
item_ratings [('item2', 2), ('item1', 3)]

I have separated those and created a dataset using

for item, rating in item_ratings:
     items.append(item)
     ratings.append(rating)
ds = tf.data.Dataset.from_tensor_slices({'item_id': items, 'positive_rating': ratings})

The elements in the dataset look like

{'item_id': <tf.Tensor: shape=(), dtype=string, numpy=b'item2'>, 'positive_rating': <tf.Tensor: shape=(), dtype=int32, numpy=2>}
{'item_id': <tf.Tensor: shape=(), dtype=string, numpy=b'item1'>, 'positive_rating': <tf.Tensor: shape=(), dtype=int32, numpy=3>}

Please refer to this gist for working code example. Thank You

rcauvin · December 30, 2023, 3:29pm

Thank you, Kiran. Your solution is elegant and works well on small, in-memory datasets. I want a solution that scales to tens of millions of records that may not all be in memory at once, so I was hoping for a solution that leverages graph execution. The output should also include items that received no positive ratings.

I was able to figure out a solution using group_by_window, though I’m not sure it’s using batching in the optimal way:

user_ids = tf.constant(["user1", "user2", "user3", "user3", "user1", "user2", "user1", "user2", "user3", "user1"])
item_ids = tf.constant(["item1", "item2", "item1", "item1", "item2", "item3", "item2", "item2", "item1", "item1"])
ratings = tf.constant([0, 1, 1, 1, 0, 0, 0, 1, 0, 1])

ds = tf.data.Dataset.from_tensor_slices({"user_id": user_ids, "item_id": item_ids, "rating": ratings})

# tf.data.Dataset.group_by_window uses integer keys, but our item IDs are strings.
# So we define a function that maps an item ID string to its index in an array of unique item IDs.
item_lookup = lambda item_id: tf.squeeze(tf.where(tf.equal(tf.unique(item_ids).y, item_id)))

batch_size = ds.cardinality() # Batches of users and ratings will never exceed the cardinality of the dataset.

# Function to compute the popularity of item. Item popularity is the
# ratio of the number of users who rated the item positively to the
# number of users who had an opportunity to rate it.
def reduce_fn(
  key,
  window):

  batched_window = window.map(lambda e: (e["item_id"], e["user_id"], e["rating"])).batch(batch_size)
  counted_window = batched_window.map(lambda item_id, user_id, rating: \
    {
    "item_id": item_id[0], # All item IDs in the window will be the same, so pick the first one.
    "positive_user_count": tf.size(tf.unique(tf.boolean_mask(user_id, rating)).y), # Count users who rated the item positively.
    "user_count": tf.size(tf.unique(user_id).y), # Count users who had the opportunity to rate the item.
    "popularity": tf.size(tf.unique(tf.boolean_mask(user_id, rating)).y) / tf.size(tf.unique(user_id).y) # Compute the ratio.
    }
  )
    
  return counted_window

item_count = len(tf.unique(item_ids).y)
print(f"Computing the popularity of {item_count} items using window size {batch_size}.", end = " ")
grouped_ds = ds.group_by_window(
  key_func = lambda elem: item_lookup(elem["item_id"]),
  reduce_func = reduce_fn,
  window_size = batch_size
)
print("Done.")

print('Results:')
for element in grouped_ds:
  print(element)

Output:

Computing the popularity of 3 items using window size 10. Done.
Results:
{'item_id': <tf.Tensor: shape=(), dtype=string, numpy=b'item1'>, 'positive_user_count': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'user_count': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'popularity': <tf.Tensor: shape=(), dtype=float64, numpy=1.0>}
{'item_id': <tf.Tensor: shape=(), dtype=string, numpy=b'item2'>, 'positive_user_count': <tf.Tensor: shape=(), dtype=int32, numpy=1>, 'user_count': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'popularity': <tf.Tensor: shape=(), dtype=float64, numpy=0.5>}
{'item_id': <tf.Tensor: shape=(), dtype=string, numpy=b'item3'>, 'positive_user_count': <tf.Tensor: shape=(), dtype=int32, numpy=0>, 'user_count': <tf.Tensor: shape=(), dtype=int32, numpy=1>, 'popularity': <tf.Tensor: shape=(), dtype=float64, numpy=0.0>}