Transparency label: AI-assisted
We’ve spent the last few years watching Large Language Models get remarkably good at text. But something more fundamental is emerging: models that don’t just understand language, but reality itself. These are Large World Models, and they represent a different approach entirely. Where LLMs predict the next word in a sentence, LWMs predict the next state of a dynamic environment.
There is an excellent documentary about Deep Mind and its extraordinary achievements both before and after acquisition by Google. Towards the end, it explains that Demis Hassabis’s team at Google DeepMind have developed Genie, a model that can generate playable games from a single image after having learnt game physics and mechanics entirely from videos. Fei-Fei Li’s World Labs and companies like Runway are pursuing similar approaches, training models on vast amounts of video to learn not how to arrange words, but how the physical world actually works: 3D geometry, object permanence, the laws of physics.
The distinction between a Large Language Model which knows how scenes are described and a Large World Model which knows how they behave is important. The shift from static text processing to dynamic world simulation opens a path towards embodied AI: systems that can navigate physical environments rather than simply generate text. These will range from advanced robotics to fully interactive, virtual simulations.