Back to Blog

We've spent the last few years talking about language models. They read. They summarize. They write. But most real work is not reading and writing in isolation. Most real work looks more like playing in a band, cooking in a busy kitchen, or managing a live project. Roles shift. Plans change. People adjust to each other.

That kind of work needs a system that can live inside a situation. Right now LLMs can just describe it. That's where world models come in. These models are designed to learn representations of environments to better understand the dynamics of the real world.

I'm not saying language models are "over." They are still central. What I'm saying is if we care about human-AI collaboration and interaction, we should start thinking beyond "better chat" and toward tools that can share the world with us.

Readers vs players

Here's a simple way to frame it:

  • A language model is like a person who has read every book in the library but never left the building.
  • A world model is like a person who has spent years playing in one complex environment. Think of a pilot in a flight simulator. Or a gamer who knows every corner of a map.

Language models learn patterns in text. They are great at compressing information, drafting and editing, and reasoning. World models learn what happens when you act. They are trained to predict how a state will change over time, respond to actions, support planning and try things out in imagination. Some current research systems can already do pieces of this with impressive results in games and simulated worlds. They can see video, take actions, and then plan inside their own internal simulation before acting for real. That is different from just predicting words.

Why this matters for collaboration

Think about how you work with other people on a real task. You watch what others are doing, you guess what they intend, you adjust your plan in real time, you change roles when the situation shifts.

A recent paper on "fluid collaboration" makes this very clear. It studies people working together in a fast, overcooked-style cooking game. People rarely sit down and assign fixed roles. Instead, patterns of collaboration form and shift as the game unfolds. Coordination often happens with minimal speech. People read each other from movement, timing, and context.

That works because humans carry a rich world model in their heads. We know what the environment affords, what others are likely to do next. We can tell when someone is making a mistake, and can repair the plan on the fly. A pure language model does not live in that loop. It can comment on the game. It can propose a strategy. But it does not share the situation with you as you act. A world-model-based system at least has the right shape of problem since it can track states over time, predict what happens when it acts, and can run rollouts and see consequences. That is much closer to what you need for tight, real-time collaboration.

A simple mental model

Here's one way I am organizing this in my head. Humans build knowledge through:

  • Experience - interacting with the world and seeing what happens.
  • Embodiment - having a body that can move, fail, and get hurt.
  • Emotion - tagging memories with "this mattered."
  • Social life - learning from others, copying, telling stories.
  • Identity - caring about what we know because it shapes who we are.

AI systems build knowledge today mostly through:

  • Data - large corpora of text, images, video, logs.
  • Compression - fitting patterns into weights.
  • Optimization - minimizing loss functions or maximizing reward.
  • Feedback - upvotes, downvotes, A/B tests, RL signals.

Language models live almost entirely in the second list. World models pull AI a bit closer to the first list, because they add interaction and consequence. Not just "what usually comes next in text," but "what tends to happen in this environment if I do X, then Y, then Z." That said, I believe that this still does not make them human or conscious or anything close to that. They still lack drive, values, and lived experience. But they do move us from "reading about the world" to "learning how the world reacts."

A framework: three layers to watch

If we want a simple way to think about this shift, we can break it into three layers:

  1. Language layer - how we describe goals, constraints, and meaning.
  2. World layer - how the environment evolves under actions.
  3. Policy layer - how an agent chooses what to do next.

Language models sit mainly in layer 1. World models sit mainly in layer 2. Decision-making lives in layer 3.

Today, a lot of systems skip layer 2 and we've all seen examples of the issues with skipping this layer. They go straight from language ("here is the request") to action ("call these tools," "write this code"). The model imagines consequences in words, not in a grounded simulation. The next wave I see is one where language models can help us state goals, rules, and explanations, and world models provide a shared environment we can interact with. Policies could be trained inside those worlds, with humans guiding and correcting them as they learn. Instead of sending prompts back and forth, humans and AI would co-own the situation and work inside the same space. This is what I mean by shared worlds.

Concerns worth addressing early

I expect pushback from several directions. Some of it is fair.

"Language models already have world models inside them."

Some may argue that large language models already internalize many facts about the world and can simulate scenarios in text. That's true in a weak sense. They can often reason about simple physics, social situations, and plans. My view is that this is useful, but it has limits. Reasoning in text alone tends to break under long-horizon, stateful tasks. There is no strong guarantee that the model's internal "world" stays consistent over many steps. When you need precise control or safety, guessing in natural language is risky. So I don't see world models as replacing this. I see them as tightening it. They give the system a more structured space to reason in, especially for tasks that involve real environments, tools, or other agents.

"This sounds like hype. We don't know if world models will scale."

That's fair. The current systems are still early. Many are trained in games or simple simulators. We don't yet know which approaches will work in open, messy domains. So I'm not claiming we've solved it. I'm saying language-only systems hit clear walls in collaboration and control. Adding explicit world models is one reasonable way to push those walls out and that we should experiment with this direction, not treat chat as the endpoint.

"You are anthropomorphizing. AI will not 'learn like humans.'"

I wholeheartedly agree. These systems do not have human emotions, bodies, or social lives. Their optimization pressures are very different from ours. When I say "learn like a player, not just a reader," I mean something narrower: if we want to move towards more productive AI we should consider giving AI systems access to interaction and consequence, not only text, and let them train policies inside that interactive space. That is an analogy for structure, not for consciousness.

"This ignores safety and misuse."

This is fair. World models make some safety problems worse and others better. I think they increase the need for alignment work, not reduce it. We'll need careful reward design, clear limits on what gets modeled, and close oversight on which policies we allow into the real world. These are only a few examples, and there are people who understand these issues far better than I do. We should pay more attention to their work. My hope is that efforts like this, trying to think clearly about world models, push us to face these problems directly.

"This underplays the value of humans."

This is the most important criticism. World models do not reduce the need for humans; they increase it. These systems still need human goals and norms. Someone has to interpret trade-offs and decide what "good" looks like. Someone has to say "no" when a high-reward plan is still wrong in a human sense. And the more capable the model becomes, the more important that human layer is.

Humans at the center

We can't predict exactly how AI will evolve, but we can be honest about the simple fact that people still do things these systems cannot. And those things matter. People bring goals that come from lived experience. We decide what we care about because we feel the impact of decisions in our own lives. We care about safety, fairness, comfort, and dignity because they affect our relationships, our work, and our limited and finite bodies. A model can repeat those words, but it doesn't feel the weight behind them.

We also bring meaning. We don't just see outcomes; we live with them. We remember how moments felt, and we think about how our choices land on the people around us. That changes how we judge an action, even when a system frames something as a "high-reward" plan on paper. Life doesn't have a win condition. It isn't a score to maximize. It's something we move through, with all the weight and responsibility that comes with being the ones who experience the results.

We also break patterns. World models learn from what has already happened. People can look at the same pattern and say, "this should stop," even when the past points in the other direction. We can refuse to repeat something harmful. We can choose a new path.

Humans also guide and correct. When we work with AI systems, we shape them. We question their assumptions, point out what they missed, and give context they can't infer. We bring values they don't have. We set boundaries they can't imagine on their own.

And people remain tied to the real world. We have to live with the consequences of what these systems do. That simple fact changes how we make judgments. It also explains why human involvement is not optional, even if the tools get stronger.

So even as AI becomes more capable, people still bring the things that keep decisions grounded in reality like goals, meaning, judgment, context, and responsibility. These are not features you can add to a model. They come from being human.

Questions that matter going forward

I don't have answers here. At this stage, the questions feel more important than any conclusions. Here are some I am sitting with:

  • How much of our work should live in simulation, and how much should stay grounded in real human experience?
  • How do we design world models that make room for social norms, institutions, and context, not just physics or reward?
  • What does "good collaboration" mean when an AI system can act alongside us, not just talk to us from the outside?
  • How do we keep people learning and curious when an agent can practice inside a simulated world for hours while we sleep?
  • What happens to expertise when humans and tools share the same internal model of a task or environment?
  • And more broadly, which parts of judgment, meaning, and responsibility remain on the human side, no matter how capable these systems become?

If we stay honest about what we don't know yet, the shift from "reader" to "player" might open up better ways of working.