Create Custom reward function in RL

Hi, I am working on an RL model in TF. I am working on a pointer network (that outputs a sequence of indices). When training the model, I want to build a custom reward function where tf output sequences can be passed through a different function individually. For example, if the output is [1,2,3,4], I want 1,2,3, and 4 individually to a function, sat F, can gives out reward values for 1, 2, 3, 4 individually. However, I get the error:

Cannot convert a symbolic Tensor (strided_slice_1:0) to a numpy array. This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported

I am not able to convert output into numpy type array which I can pass through to the custom function. I have seen it can be directly done in pytorch but I tried everything I could find on stack overflow and other places but could not figure out how to do that in tensorflow. Let me know if someone can help with this. Some code:

here I am getting sequence of indices for a batch

for step in range(1,self.max_length): # sample from POINTER
query = tf.nn.relu(tf.matmul(query1, W_1) + tf.matmul(query2, W_2) + tf.matmul(query3, W_3))
logits = pointer(encoded_ref=encoded_ref, query=query, mask=self.mask_, W_ref=W_ref, W_q=W_q, v=v, C=self.C, temperature=self.temperature)
prob = distr.Categorical(logits) # logits = masked_scores
idx = prob.sample()

        idx_list.append(idx)                           # tour index
        log_probs.append(prob.log_prob(idx))           # log prob
        entropies.append(prob.entropy())               # entropies
        self.mask_ = self.mask_ + tf.one_hot(idx, self.max_length) # mask
        
        idx_ = tf.stack([tf.range(self.batch_size,dtype=tf.int32), idx],1) # idx with batch   
        query3 = query2
        query2 = query1
        query1 = tf.gather_nd(actor_encoding, idx_)    # update trajectory (state)
        
    idx_list.append(idx_list[0])                       # return to start
    self.tour      = tf.stack(idx_list, axis=1)        # permutations

i want to pass this tour (that has size batch size x input dimension x dimension) and return reward values of size [batch]

thank you! Any pointer or help is highly appreciated :slight_smile:

1 Like