What broke when I tried LLMs on LunarLander

TLDR: I took the CartPole setup from the last post and moved one notch up to LunarLander. The easy recipe broke. API models were still too slow, raw local LLMs could not fly the lander once their actions changed the next state they had to deal with, and LoRA SFT never got off the ground. The thing that finally worked was a more classical imitation-learning stack: full SFT, better data, and DAgger. The ending is a little awkward: the model solved LunarLander, but stopped being useful as a language model.


GitHub repo — code to reproduce the main results.

In the CartPole post, the arc was tidy: API inference was too slow, local inference was fast enough, a pretrained Gemma 3 1B got surprisingly close, and a small LoRA SFT run made it reliable.

That was encouraging, but also suspiciously easy. CartPole is a two-action task with a nearly linear rule hiding inside it, so the next question was obvious: what breaks if I turn the difficulty up a little? To find out, I moved to LunarLander-v3.

LunarLander is still not a serious robotics benchmark. It is a toy Gymnasium environment. But it adds just enough extra structure to stop the CartPole trick from being free. The state grows from 4 values to 8, the action space grows from 2 choices to 4, reward is shaped instead of mostly survival time, fuel costs matter, "landed" and "solved" are not the same thing, and there is a native no-op action.

One caveat up front: if your only goal is to solve LunarLander, do not use an LLM. Use PPO, the built-in heuristic, or a tiny neural net. I used an LLM here for the same reason I used one on CartPole: it is the wrong tool in a controlled way, which makes it useful for learning what breaks.

Real-time setup

I kept the real-time framing from CartPole.

LunarLander's Box2D simulator advances at FPS = 50. That means one physics step is 1 / 50 seconds, or 20 ms. To make latency matter, I matched wall-clock time to that physics clock: every 20 ms, the environment steps once and needs an action.

If the model has not answered by the next tick, I still have to do something. I tested two choices: apply action 0 while waiting, or keep applying the previous action until a new one arrives.

This mattered more than I expected. With a good policy, hold-last is forgiving because thrust continues between model calls; with a bad policy, hold-last is awful because it sustains the bad action. If the model says "fire main engine" when it should not, hold-last keeps burning fuel.

API inference failed immediately

I started with Groq llama-3.3-70b, because that was the fastest API setup I had used in the CartPole experiments.

It crashed the lander in about 1 to 1.6 seconds. Mean reward was about -129.

The model technically produced a few actions, but they barely mattered. Under no-op semantics, each action fired for one 20 ms tick, then the lander went back to doing nothing until the next response arrived. Visually, the lander mostly just fell.

So the API lesson got even simpler than it was in CartPole: for 50 Hz control, API latency is dead on arrival.

The latency cliff

I built the same kind of latency curve as before. Take Gymnasium's LunarLander heuristic, treat it as the good policy, and add artificial delay.

The two wait semantics produced different cliffs. Under no-op, performance fell apart between 20 ms and 40 ms. Under hold-last, the heuristic stayed good until roughly 100 ms.

That gave me a useful way to interpret the rest of the experiments. If a model had a good policy, hold-last bought it a lot of latency budget; if the model's actions were bad, latency was not the main problem anymore.

The boring baselines worked

Before I spent too long on LLMs, I checked what normal methods could do.

Gymnasium's reference heuristic landed 20/20 episodes, solved 17/20, and averaged about +262 reward.

"Landed" means the episode ended with positive reward. "Solved" means total reward was at least +200.

Then I trained a tiny MLP with behavior cloning, the simplest kind of imitation learning. Imitation learning means learning from expert demonstrations instead of discovering behavior from reward alone. It comes out of robotics, control, and RL work on apprenticeship learning: if you already have a good pilot, driver, robot operator, or hand-coded heuristic, you can copy it.

Behavior cloning is the direct version: collect (state, action) examples from the expert, then train a model with supervised learning to predict the expert's action.

This is closely related to distillation. In LLM-land, distillation usually means training a student model to imitate a teacher model. In control, behavior cloning means training a policy to imitate an expert policy. The details can differ, but the basic move is the same: use a teacher's behavior as supervised training signal.

The MLP result was funny in the way these projects often are: an 18k-parameter network trained on 500 heuristic rollouts got 20/20 solved with mean reward around +271.

So yes, obviously, the tiny neural net is the right tool if the task is "solve LunarLander."

The LLM question was different: how far does the CartPole recipe stretch before we need new techniques?

Pretrained local models did not fly

I tried a bunch of small local models.

Some were fast enough. Some looked decent on static diagnostic states. None worked when they had to fly the lander step after step.

That is the key difference. In the real simulator loop, the model observes a state, chooses an action, the simulator advances, and then the model has to act from the state it just helped create. Control people call this closed-loop control.

That is very different from labeling a frozen list of states, and the distinction ended up being central. A model could answer a few isolated states plausibly and still fly terribly; in the actual environment, the models got stuck in simple ruts: always no-op, always main engine, always one side thruster, or overcorrect until crash.

Static state classification was not enough. The model had to produce a stable trajectory. It did not.

LoRA SFT did not transfer

After CartPole, my default plan was simple: generate heuristic-labeled examples, train Gemma 3 1B with LoRA, and ask it to output one digit.

That did not work.

I tried natural heuristic data, bigger LoRA rank, balanced action classes, and noisy rollouts. All the LoRA runs failed. Some were worse than doing nothing, because the model learned to fire engines at the wrong time and rack up fuel costs while drifting away.

This was the first real crack in the CartPole story. CartPole was simple enough that a small adapter could steer the model into the right policy; LunarLander was not, and the frozen base model prior seemed to fight the control policy too much.

Full SFT got the first landing

The first LLM landing happened when I dropped LoRA and trained the whole model.

Full SFT means all model weights are trainable, not just small adapter matrices. It gives the model much more freedom. It also makes catastrophic forgetting much more likely.

The first full-SFT run still was not good, but it landed once, which was enough signal to keep going.

Then I combined full SFT with the data fixes. I balanced the action classes so the side-thruster actions were no longer drowned out, and I added noisy heuristic rollouts so the model saw some off-trajectory states.

That model, v6, finally looked like it was flying. It landed 12/20, solved 10/20, and reached mean reward around +105.

But it had one very clear remaining failure mode: it hovered.

Several episodes ran all the way to the 1000-step cap while the model kept firing engines and wasting fuel. It had learned how to avoid crashing immediately, but it had not always learned when to stop trying so hard and just descend. That smelled like classic behavior-cloning distribution shift.

DAgger fixed the hover problem

DAgger stands for Dataset Aggregation.

The idea is simple:

  1. Run the student policy.
  2. Record the states the student actually visits.
  3. Ask the expert what it would do in those states.
  4. Add those examples to the dataset.
  5. Retrain.

Plain behavior cloning trains on expert states. But the student does not visit exactly those states. Once it makes a mistake, it can drift into parts of the state space where it has no training signal. DAgger gives it labels in the places it actually gets into trouble.

For LunarLander, the DAgger data was almost comically on the nose.

I rolled out v6, compared its actions to the heuristic's actions, and kept the disagreements. About 56% of visited states disagreed with the heuristic, and about 51.6% of those corrections were action 0.

In plain English, the model was firing engines and the heuristic was saying "stop firing." That was the hover bug.

After retraining on those corrections, v7 landed 18/20, solved 17/20, averaged +223.7, and had zero hover timeouts.

Then I ran a larger 50-seed eval. On that run, v7 landed 50/50, solved 49/50, and averaged +269.4.

I tried one more DAgger round, but it regressed slightly. So the best model was v7, not the newest one.

The awkward ending

At this point, the LLM solved LunarLander, but it was not really an LLM anymore.

I asked the final full-SFT model ordinary questions: "What is 2 + 2?", "Name a color.", "Tell me a joke.", "What is the capital of France?" It answered "0".

So yes, full SFT caused catastrophic forgetting. The model became a 1B-parameter LunarLander policy.

I do not want to overstate that result. There are obvious ways I might have reduced the forgetting: mix general instruction data back into the fine-tuning set, keep a KL penalty toward the base model, train a larger or more carefully targeted LoRA adapter, or spend more time tuning the full-SFT recipe. I skipped that on purpose. The goal was to see what solved the environment, not to preserve a useful chat model at the same time.

Meanwhile, the 18k-parameter MLP solved the same task, which is the honest ending here. The LLM solved the environment, but the artifact is absurdly inefficient if all you care about is LunarLander; the value was not the controller, but seeing where the easy SFT story broke.

What changed from CartPole

CartPole made the recipe look like this: generate labels, run SFT, solve the task.

LunarLander showed the machinery underneath.

SFT is behavior cloning, and behavior cloning has old, well-known failure modes: class imbalance, distribution shift, compounding errors, teacher quirks, and overfitting to expert trajectories.

The fixes were also old: balance the data, add noisy/off-trajectory states, use DAgger, and train enough of the model to actually change the policy.

The main lesson for me was not "LLMs are good LunarLander controllers." They are not the right tool for that.

The lesson was that the CartPole result hid the hard parts. LunarLander forced me to use the imitation-learning toolbox for real.

Where I am stopping

LunarLander did its job: it was the calibration point where the easy CartPole story stopped being easy.

The final result is that API models are still too slow for 50 Hz control, raw pretrained local LLMs are not stable controllers when their actions feed back into the next state, and LoRA SFT did not transfer from CartPole. Full SFT plus better imitation-learning data did work, DAgger was the key final fix, and the final LLM policy solved the task while becoming a giant single-purpose neural net.

That last point matters.

Frontier labs are not mostly training LLMs to land spacecraft at 50 Hz. They are training models to solve math, write code, use computers, call tools, and complete long-horizon tasks with verifiable outcomes.

So the next step is to move away from real-time control and toward verifier-rewarded text environments.

The natural next project is GSM8K: grade-school math problems, answer extraction, verifier-based scoring, first with SFT and then with RL.

CartPole taught me that SFT can be easy, while LunarLander taught me where SFT starts to break. The next question is what happens when there is no cheap teacher policy at all, only a verifier.