Generating Embeddings from Nested Sequences

Data : Lists of ordered sequences, some of which may contain nested sequences.(As in the example below)

I have input sequences that contain not just single elements, but also lists.

Eg : i1 = [2, 5, 10, [1, 7 ,9 , 20], 11, 32].

If it were just a sequence like [2,4,6,7] that does not contain a nested sequence, I would just directly pass them to the embedding layer. But, in my case, that’s not possible.

The elements in my sequences are ordered by their date/time of occurrence.

So, for each ID, I have a sequence of ordered events.
Sometimes, multiple events occur on the same day, which leads to nested lists.

For example, consider the sequence [A, B, [D, C, I, K], M]

This means, Event A has occurred on day 1, event B on day two, and events[D,C,I,K] on day 3 etc.
So, given a sequence of events for each unique ID, my goal is to predict what will be the next event/sequence of events via an LSTM model.

I have just converted these events represented by text into integer tokens, and subsequently got their count vectors/one-hot vector representation.

But, I’m facing troubles getting embeddings from such an input representation.

Embedding layers in TF/Keras would only accept integer tokens and not one-hot vectors as input.

So could someone please tell me how to get embeddings for such input representation?

Could someone please provide a simple working example for some sample sequences like this?

  1. [A, B, [D, C, E], L]
  2. [[S,T,B], M]
  3. [M, N, [L,U]]
  4. [A, B, L]

Where A, B, C,… etc are events represented by text and lets say I want to represent those events by embedding vectors of size 50 or 100. With padding length = 4. In that case, my Input dim should be (None, 4) and the output via embedding layer should be (None, 4, 50) or (None, 4, 100) depending on the vector size. [None - batch size]

With Integer tokens  : 

A - 1, B-2, C-3, D-4, E-5, L-6, M-7, N-8, P-9, S-10, T-11, U-12

The padded sequences would look like this : 

1. [1, 2, [4,3,5], 6, 0]
2. [[10, 11, 2], 7, 0, 0]
3. [7, 8, [6, 12], 0]
4. [1, 2, 6, 0]

Now, could someone please help me get outputs from embedding layer of the shape (Batch_size, seq_len, dim_len)?

Or, are there better suggestions to represent my LSTM input which contains nested sequences ?