Debugging a new RL manipulation environment

Eugene Frizza · March 3, 2026

RL manipulation mujoco sim

I remember reading somewhere that a good way to validate a new simulation environment is to train a simple RL policy in it. If it can’t learn simple behaviour, then you’re probably not going to be able to get other policies (behavioural cloning, VLA) or more complex tasks to work well either. Because mjlab had a reference implementation for the YAM arm, I decided to port it over to the XArm Lite6 that I’ve been trying to use, and had some issues with in previous posts.

First implementation

The initial porting of the YAM environment to my Lite6 environment was really easy with Claude to do the grunt work. Unfortunately, on first running, it produced no motion at all. This was easy to investigate by using a random policy (--agent random in mjlab play script), which gave no motion in the arm when visually inspected. The issue was just too small an action scaling (action = scale * raw_action_from_actor + offset). The scaling on the YAM was calculated from each actuator’s effort limit and stiffness. Because the actions are joint positions, I opted to just scale by the joint ranges - actor output in [-1, 1] would scale to [joint_min, joint_max] for each joint. Now we were getting visible output - time to learn!

Training

Because RL is difficult to debug, it is common practice to start off with the simplest possible problem and make sure it can learn before moving to anything harder. This meant forgoing vision, and simply providing the actor with the relative distance from end effector to cube. If it can’t learn on this, there’s no way it can learn from pixels. It’s also a lot quicker to train and iterate on, as it avoids rendering a camera in simulation, and makes the actor neural network much smaller.

During the first training run, it was unable to complete the task - picking up the cube. As is commonly the case in RL environments, there were no obvious clues as to what was going wrong or how to pinpoint the issues. The behaviour observed:

Noisy/jittery actions
End effector would come close to the cube but never lift it

There were a couple of straightforward bugs that don’t bear explaining:

The cube was too large for the tiny grippers on the model (> make the cube smaller)
The gripper wasn’t to be part of the action space (no wonder it didn’t pick it up) - the downsides of trusting an LLM
Joint end range penalties were being applied to the gripper - these grippers needed to be able to fully open and close without being penalised.
I reduced the action scaling by a factor of 8 to reduce jitter - it didn’t remove it but it was a better starting point for further debugging.

However, it still wasn’t learning the task. As I changed parameters, I was also getting some training runs that crashed due to NaN/Inf values appearing in the simulation state and corrupting the action sampling.

Mujoco comes with a nan guard functionality, which dumps the past 100 timesteps leading up to a NaN/Inf value so it can be diagnosed. I could see that the crashes were occurring when the robot was self-intersecting. I had made the gripper contact constraints very stiff on this model previously, as I always had issues with excessive penetration between the grippers and other objects. I decided to stop self-intersection altogether by making it a termination condition.

This helped a lot, and was clearly good to stop the policy learning self-intersecting behaviour, but the NaN blowup would still occur occasionally. Further dumps showed the gripper squeezing the block against the robot base, and the block then shooting out at high velocity, with high forces involved. This was a sign to turn down the stiffness of the contacts - I had turned them high in previous experiments with this environment as I saw a lot of penetration at the grippers.

Contact dynamics

Digging in further, I could see these penetration issues persisted, and I was unable to get good contacts and friction for the cube to pick up the gripper at all when operating the model manually (without an RL agent). This told me something was wrong in the model.

I made a couple of discoveries that helped:

Simplifying the gripper geometry to just be a box, instead of an STL mesh, made the contacts function way better. See this video:

STL-model gripper on the left - notice how far the grippers penetrate the cube, and how different the friction forces are. Grippers made from box primitive on the right. Both have the same contact solver and friction parameters.

Changing the Mujoco friction cone to elliptical and increasing impratio to >10. The standard pyramidal friction cone misses some of the contact forces in faviour of being more computationally efficient - a reasonable tradeoff for locomotion tasks, but not for manipulation where contact forces are critical. impratio is a multiplier that scales friction force relative to normal force of the contact - increasing it above 1 reduces slip, and commonly a value of 10-100 is used. You can see over the course of the video how it doesn’t entirely prevent slip but slows it down enough to not be an issue. More details in the Mujoco docs here.

Elliptical friction cone and impratio set to 30

I could then reduce the stiffness of the contacts, avoiding explosive contact forces and NaN blowups, by setting the solref and solimp parameters back to default. The blowups were now happening very infrequently, and I could keep training - but the task still wasn’t being solved.

Action jitter

The next crucial piece of the puzzle was reducing the action jitter from the actor. Having the arm vibrate whilst trying to pick up the cube seemed like it would not be good for the frictional forces required for lifting.

Jittery actions

It was not coming from any joint in particular, all had the jitter. My first thought was to penalise it in the rewards. The relevant existing rewards were:

action_rate_l2 - which penalises change in action
joint_vel_hinge - which penalises joint velocity over a threshold (by default 0.5rad/s)

This would be a tradeoff between allowing the robot to be dynamic and reducing its motion. I figured I may as well reduce the joint_vel_hinge threshold to 0, so that these small jitters were penalised:

Reducing joint_vel_hinge threshold to 0 so all velocities are penalised helped a bit - but still not good enough.

This definitely helped but did not eliminate them. It is also a squared loss term, so all these small jitters well under 1rad/s would be penalised by a quadratically smaller term, which is not well suited to the problem I was trying to solve.

The other option was to reduce the PID gains of the actuators in the model. Tight gains are useful when doing motion planning, so the end effector goes to the commanded position quickly and accurately. However, when doing reinforcement learning, the model can learn to compensate for the loose gains by “overcorrecting” the commanded position. It also gains additional benefits:

there’s built-in compliance as the low-level PID loop is not stiff
it helps with exploration

Some wisdom from Kyle Morgenstein on this.

With this I finally got a policy that learned!

The final policy learned - success!

Conclusion

I spent a lot of time messing around with reward shaping, mostly to try and get the robot to pick up the cube instead of just hovering close to it. I added an extra reward term for closing the grippers when near the cube, changed the standard deviation on the reward terms to encourage the the gripper to come closer to the cube and to lift higher, played around with the weights, etc. After solving the underlying environment issues, all of these were red herrings, and the policy learned the best with the default rewards. Although, having specific rewards to explicitly force behaviour, like closing the grippers, was a useful tool in investigating why it would not learn such behaviour, and led to the more fundamental discoveries. If I didn’t start from a reference implementation for this task I probably would have spent all my time trying to reward shape, so I’m grateful to the mjlab team for this!

Share: Twitter, Facebook