I have a dataset defined as follows:
user_ids = [“user1”, “user2”, “user3”, “user3”, “user1”, “user2”, “user1”, “user2”, “user3”, “user1”]
item_ids = [“item1”, “item2”, “item1”, “item1”, “item2”, “item3”, “item2”, “item2”, “item1”, “item1”]
ratings = [0, 1, 1, 1, 0, 0, 0, 1, 0, 1]
ds = tf.data.Dataset.from_tensor_slices({“user_id”: user_ids, “item_id”: item_ids, “rating”: ratings})
I want to calculate a dataset result_ds with each item ID and the number unique users that rated the item a ‘1’. I’d like the solution to be able to handle tens of millions of records and not just this small example.
The result should look something like:
for element in result_ds:
print(element)
{‘item_id’: <tf.Tensor: shape=(), dtype=string, numpy=b’item1’>, ‘positive_count’: <tf.Tensor: shape=(), dtype=int64, numpy=2>}
{‘item_id’: <tf.Tensor: shape=(), dtype=string, numpy=b’item2’>, ‘positive_count’: <tf.Tensor: shape=(), dtype=int64, numpy=2>}
{‘item_id’: <tf.Tensor: shape=(), dtype=string, numpy=b’item3’>, ‘positive_count’: <tf.Tensor: shape=(), dtype=int64, numpy=1>}
I tried using ds.group_by_window
but couldn’t get it to work.