I am not sure if this is the right place for my question, but did not really know where/who to ask. Mayber there are some Reinforcement Learning experts in the forum that can provide some resource suggestions.
I am currently experimenting with PPO in different environments. I am interested in learning policies which fulfill a certain goal while keeping a specific value low. Here’s an example: Using PPO on a cartpole environment to learn an upswing of the pole, but simulateously keeping the angular velocity of the pole low. The standart approach is to include a penalty on the pole velocity in the reward function. However, I observed that penalizing the velocity from the beginning reduces sample efficiency significantly and hinders learning good policies. For this reason, I tried using only a small penalty on the pole velocity until PPO converges to a decent policy and then apply a refinement step in which I penalize the torque much more to get good performance and low velocities. This seems to work better. I observed similar behaviour on other environments in a similar setting.
I want to find a (formal) reason for this behaviour (why does penalizing velocities from the beginning hinders learning). Does anybody you have some literature tips on stochastic optimization/rl that could be useful? Or some resources on the topology of high dimensional spaces? Or even an idea for an explanation for this behaviour?
Thanks for any help/tips.
All the best