# Experiment 005: 4,136 PPO Games and 150-Game Evaluation

**Date:** 2026-05-16
**Commit:** `088f25d`
**Primary checkpoint:** `models/ppo_sts.pt` (local artifact, intentionally gitignored)
**Seed file:** `seeds/eval_200.txt`
**Source artifacts:** `logs/training_stats.csv`, `logs/eval_stats.csv`, `logs/bc_stats.csv`, `logs/bc_train_stats.csv`, `logs/fight_stats.csv`, `logs/training_plot.png`
**Public snapshot:** [docs/experiments/index.json](index.json)

## Purpose

Publish the next PPO checkpoint snapshot after 4,136 completed rollout games and compare it against the heuristic and BC baselines using the latest 150-game evaluation logs.

## Run Configuration

| Field | Value |
|---|---:|
| Rollout workers observed | 5 |
| Completed PPO games in `training_stats.csv` | 4,136 |
| PPO update rows | 515 |
| Total trainer update transitions | 644,393 |
| Batch size | 8 rollout files per PPO update |
| PPO epochs | 4 |
| Initial learning rate | 0.00003 |
| Latest auto-tuned learning rate | 1.7462298e-05 |
| Latest entropy coefficient | 0.00074944 |
| Latest normalized entropy | 0.241497 |
| Latest BC anchor coefficient | 0.01 |
| Clip range | 0.15 |
| Target KL | 0.03 |
| Max rollout lag | 4 updates |
| Latest stale / legacy / skipped rollouts | 6 / 0 / 0 |
| Cumulative stale / legacy / skipped rollouts | 6 / 0 / 0 |

## Training Outcome Snapshot

| Metric | Full run | First 500 | Through 2,496 | New games after 2,496 | Last 1,500 | Last 500 | Last 100 |
|---|---:|---:|---:|---:|---:|---:|---:|
| Games | 4,136 | 500 | 2,496 | 1,640 | 1,500 | 500 | 100 |
| Average final floor | 14.19 | 12.99 | 13.80 | 14.78 | 14.77 | 14.56 | 14.31 |
| Median final floor | 16 | 14 | 16 | 16 | 16 | 16 | 15 |
| Best final floor | 50 | 31 | 50 | 37 | 37 | 33 | 33 |
| Average shaped reward | 1.79 | -1.13 | 1.10 | 2.86 | 2.84 | 2.18 | 2.45 |
| Act 2 reach rate | 14.9% | 7.8% | 13.4% | 17.3% | 17.3% | 15.6% | 17.0% |
| Floor 20+ rate | 13.5% | 6.6% | 12.3% | 15.4% | 15.2% | 13.4% | 15.0% |
| Win rate | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |

Latest trainer row:

| Metric | Value |
|---|---:|
| Approximate KL | 0.00362303 |
| Clip fraction | 0.076596 |
| Normalized entropy | 0.241497 |
| Explained variance | 0.600711 |
| Mean chosen action probability | 0.788474 |
| Auto-tune action | `middle:bc_slow_down` |
| Early-stop rows | 5 |

## 150-Game Evaluation

Evaluation used the latest 150-game logs for the heuristic baseline, BC checkpoint, and current PPO checkpoint.

| Policy | Model | Games | Avg floor | Best floor | Avg reward | Win rate | Act 2 reach | Floor 20+ | Elite W/L | Boss W/L |
|---|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| Heuristic | `heuristic` | 150 | 15.78 | 33 | 8.44 | 0.0% | 26.0% | 23.3% | 176/215 | 39/100 |
| BC checkpoint | `models/ppo_sts_bc.pt` | 150 | 12.81 | 39 | -0.55 | 0.0% | 12.0% | 12.0% | 121/179 | 19/62 |
| PPO checkpoint | `models/ppo_sts.pt` | 150 | 14.70 | 33 | 2.37 | 0.0% | 18.7% | 18.0% | 116/155 | 28/92 |

Seed audit:

| Run | Games | Unique seeds | Matches first 150 seeds exactly |
|---|---:|---:|---:|
| `heuristic_150_20260515_115841` | 150 | 149 | no |
| `bc_150_20260515_172740` | 150 | 149 | no |
| `ppo_current_150_20260515_172740` | 150 | 150 | yes |

The PPO eval exactly matches the first 150 seeds in `seeds/eval_200.txt`. The heuristic and BC logs each contain one duplicated seed from an interrupted/resumed eval sequence, so the table above is the latest full-log result rather than a perfectly paired 150-seed comparison. On the 148 seeds common to all three logs, the average floors are heuristic 15.78, BC 12.83, and PPO 14.62.

## Interpretation

From the previous public snapshot at 2,496 PPO games to this snapshot at 4,136 games, the additional 1,640 games averaged floor 14.78 with a 17.3% Act 2 reach rate. That is better than the full-run average through 2,496 games (13.80 average floor), so the longer run did improve the training distribution overall.

The short-term curve is less exciting: the last 500 games averaged floor 14.56 versus 14.92 for the old last-500 window. The latest last-100 window is 14.31. This looks like a plateau or mild cooling period, not a clean breakout.

The 150-game eval is the important correction to the prior 25-game read. PPO still beats BC by 1.89 average floors and 2.93 shaped reward, but it trails the heuristic by 1.08 average floors. The previous 25-game PPO result at 16.40 average floor was optimistic; the wider 150-game sample puts PPO at 14.70.

No policy recorded a win in this snapshot. PPO did convert bosses better than BC in aggregate (28/92 versus 19/62), but it still reaches Act 2 less often than the heuristic (18.7% versus 26.0%).

Normalized entropy remains fairly flat around 0.241497 while the auto-tuner has lowered `ent_coef` to 0.00074944. That is not automatically bad: the policy is still sampling with moderate spread, and the low KL/clip values show updates are conservative. The result signal says the current bottleneck is probably policy quality and decision distribution, not a runaway entropy setting.

## Recommended Next Changes

Do not manually raise entropy for the next run. The current normalized entropy is inside the auto-tune healthy band of 0.20 to 0.50, and the last 25 trainer updates averaged roughly 0.251 normalized entropy. Leave auto-tune on unless normalized entropy drops below 0.20 or the policy becomes visibly deterministic too early.

Use a clean fixed-seed eval set for the next comparison. The PPO 150-game eval exactly matched the first 150 seeds, but the heuristic and BC eval logs each contain one duplicated seed from an interrupted/resumed eval sequence. Treat this report as directionally valid, then run a clean 200-seed comparison before declaring a checkpoint better or worse.

Run the next PPO experiment as a controlled 500 to 1,000 game test instead of another long blind run. The actor updates are conservative: recent KL is closer to 0.003 to 0.005 than the 0.006 to 0.010 range that would indicate stronger but still sane policy movement. Prefer a small update-strength experiment before touching entropy.

Prioritize data and diagnostics around Act 1 boss and elite outcomes. PPO is above BC but below the heuristic, and many eval games still die around floor 16. The most useful next data improvements are cleaner boss/elite decision coverage, especially Guardian, Hexaghost, and Lagavulin situations, rather than more generic rollout volume.