Generating Embeddings from Nested Sequences

Data : Lists of ordered sequences, some of which may contain nested sequences.(As in the example below)

I have input sequences that contain not just single elements, but also lists.

Eg : i1 = [2, 5, 10, [1, 7 ,9 , 20], 11, 32].

If it were just a sequence like [2,4,6,7] that does not contain a nested sequence, I would just directly pass them to the embedding layer. But, in my case, that’s not possible.

The elements in my sequences are ordered by their date/time of occurrence.

So, for each ID, I have a sequence of ordered events.
Sometimes, multiple events occur on the same day, which leads to nested lists.

For example, consider the sequence [A, B, [D, C, I, K], M]

This means, Event A has occurred on day 1, event B on day two, and events[D,C,I,K] on day 3 etc.
So, given a sequence of events for each unique ID, my goal is to predict what will be the next event/sequence of events via an LSTM model.

I have just converted these events represented by text into integer tokens, and subsequently got their count vectors/one-hot vector representation.

But, I’m facing troubles getting embeddings from such an input representation.

Embedding layers in TF/Keras would only accept integer tokens and not one-hot vectors as input.

So could someone please tell me how to get embeddings for such input representation?

Could someone please provide a simple working example for some sample sequences like this?

  1. [A, B, [D, C, E], L]
  2. [[S,T,B], M]
  3. [M, N, [L,U]]
  4. [A, B, L]

Where A, B, C,… etc are events represented by text and lets say I want to represent those events by embedding vectors of size 50 or 100. With padding length = 4. In that case, my Input dim should be (None, 4) and the output via embedding layer should be (None, 4, 50) or (None, 4, 100) depending on the vector size. [None - batch size]

With Integer tokens  : 

A - 1, B-2, C-3, D-4, E-5, L-6, M-7, N-8, P-9, S-10, T-11, U-12

The padded sequences would look like this : 

1. [1, 2, [4,3,5], 6, 0]
2. [[10, 11, 2], 7, 0, 0]
3. [7, 8, [6, 12], 0]
4. [1, 2, 6, 0]

Now, could someone please help me get outputs from embedding layer of the shape (Batch_size, seq_len, dim_len)?

Or, are there better suggestions to represent my LSTM input which contains nested sequences ?

Hi @Bharathi_A

Welcome to the TensorFlow Forum!

Could you please tell us if this issue is resolved or still persist? If so, Please share the reproducible code to replicate and understand the issue better. Thank you.