Behavioral Verification
Continuous runtime verification derived from intent, not human-written tests. How do you know generated software is correct when there's no spec?
Current Frontier
Runtime invariant generation: automatically deriving behavioral invariants from intent specifications.
Key Questions
How do you verify correctness when there's no traditional spec to test against?
Can behavioral invariants be derived automatically from intent?
What does 'correct' even mean for generated software?
Key Papers
World-in-World: Closed-Loop WM Evaluation
ICLR 2026 Oral
First benchmark platform for closed-loop world model evaluation. Key finding: controllability matters more than visual quality.
MaaG: Model as a Game
Microsoft (ICCV 2025 Workshop)
LogicNet for numerical consistency, spatial memory. Directly attacks score-tracking and physics verification.
PIWM: Physics-Informed World Model
Sep 2025
60.6% improvement in physics consistency. Soft mask training + warm start inference.
Critiques of World Models as Planning
Jul 2025
Systematically catalogs failure modes — shallow coherence, error explosion, generality limits.
Current Insights
In The Last Computer, 'correct' means 'matches intent.' Verification becomes: does the generated experience do what the user wanted?
Continuous verification — checking invariants at runtime, every frame — is more powerful than traditional testing.