Blog

Apr 1, 2026



World Models & Robotics: What the Frontier Is Actually Thinking

SHARE
NOW


Qualcomm Ventures recently hosted an intimate dinner with researchers and founders from some of the most influential labs in AI and robotics. The conversation centered on one question: What is overhyped and what is underhyped in world models and robotics?

Here’s what we learned.


The term “world model” is being used loosely

The most consistent theme of the evening was that “world model” has become a catch-all. Multiple researchers independently noted that most systems described as world models are, in practice, video prediction models. They can process what a camera sees, but they cannot feel pressure, sense weight, or understand the physics of how objects behave when touched. A robot trying to fold laundry, for example, needs to understand fabric tension and grip force, not just what the shirt looks like on screen.

As one person put it directly: the name says, “world model,” but the reality is still a video model. Until these systems can process the full range of sensory information that humans take for granted, the gap between the name and the capability remains wide.

For investors, this creates a filtering problem. A growing number of startups are adopting the term to differentiate themselves from the crowded language model space. Separating genuine technical advances from rebranding exercises requires looking under the hood.


Benchmarking is the biggest structural bottleneck in robotics

If one insight defined the evening, it was this: the field of robotics lacks a universal way to measure progress. In language models, researchers can run standardized tests and say with confidence that one model is better than another. In robotics, no equivalent currently exists. Without that measuring stick, it is difficult to reliably predict when robots will be good enough to deploy commercially, or how much investment in training will translate into real-world improvement.

One framework discussed was an “evaluation cascade”: think of it like a hiring funnel where you screen resumes before scheduling interviews before flying candidates in. Similarly, you want cheap, fast tests to filter out bad robot behaviors before running expensive real-world trials. Structuring this funnel well is itself a competitive advantage, and very few teams are investing in it seriously.

The verification problem extends to video and world models as well. When ChatGPT gives you two written answers, you can quickly compare them and pick the better one. But when a video model generates a 30-second clip of a robot performing a task, checking whether the physics are realistic requires watching it multiple times, frame by frame. Without a reliable way to score these outputs automatically, the training techniques that made language models so powerful cannot be applied to video generation. This layer remains critically underfunded.


Text-based planning may be the dark horse of robotics

A surprising point of convergence at dinner was the potential of planning in plain language rather than in video. One researcher argued that when a model tries to predict exact future video frames, small errors compound and become catastrophic for precise tasks. But high-level task planning in text, such as breaking down “clean my kitchen” into “first clear the counter, then load the dishwasher, then wipe down surfaces,” is no harder than writing code, something language models already do well.

The proposed architecture: a text-based planner for high-level reasoning paired with a fast, lightweight policy for low-level control. This decouples the hard problem from the easier one and allows robotics companies to leverage existing LLM infrastructure rather than building entirely new stacks.

Several attendees echoed this view, arguing that the robotics community may be distancing itself from language model techniques too early. Many of the breakthroughs that made ChatGPT possible, such as learning from feedback, compressing context efficiently, and generating outputs step by step, are directly applicable to robotics. The field may not need to reinvent the wheel.


Hardware is not commoditized

One of the more emphatic arguments of the evening came from a researcher who pushed back hard on the idea that robot hardware is a solved problem. Entire research lines, including simulation-to-real transfer, are either enabled or blocked by the quality of the underlying hardware. If a robot’s physical design cannot be accurately simulated, the most promising training techniques simply don’t apply.

To that end, the concept of morphological intelligence came up repeatedly: the idea that intelligence can be built into the physical design itself. One example raised was a kinetic sculpture called a strandbeest, which can walk using only wind power and zero electronics. The “computation” is encoded entirely in the geometry of its legs and joints. Similarly, a well-designed robot hand that naturally conforms to objects it grips requires far less software to control than a rigid one. This suggests that hardware design choices have downstream consequences for what kinds of intelligence are possible, a point that seems underappreciated in a field that tends to focus on software.


Data is still underrated. Model architecture may be overrated.

One of the sharpest observations of the evening was also one of the simplest: data is still underrated, and model architecture is overrated. Any reasonable architecture with enough compute performs similarly at the limit. The differentiator is the data you train on, not the model you build. And the teams with the most resources will accumulate the most data, compounding their advantage over time.

This has implications for how the field studies progress. In language models, there is a well-understood relationship: invest X dollars in compute and data, and you can predict roughly how much better the model will get. In robotics and video generation, that relationship does not yet exist. Models improve with more investment, but nobody can predict by how much. The reason is circular: without good benchmarks, you cannot establish these predictive relationships, and without them, you cannot make a confident economic case for large-scale investment. Whoever solves the data representation problem, particularly how to encode the experience of different types of robots and human demonstrations into a common format that transfers across machines, is likely to have a structural advantage that no model design innovation can match.


The path from demos to deployment is longer than it looks

Listening to the conversation, one thing stood out to us as investors: there is still a meaningful gap between what robots can do in controlled demos and what it takes to sustain them in real-world operations. From our vantage point, the industry is still in the early innings of figuring out deployment economics, including the operational overhead of maintaining robots in the field, the supervision requirements, and the timeline to positive unit economics.

That said, we remain genuinely optimistic. The underlying technical progress is real, and scaling laws appear to hold. We believe the companies and investors who think seriously about deployment challenges now, rather than after the first wave of field deployments, will be better positioned for the long term.

Over the long term, scaling laws remain intact, hardware continues to improve, and the underlying technical progress is undeniable.


Where we go from here

At Qualcomm Ventures, conversations like these shape how we think about the robotics landscape. The signal from the frontier is clear: the excitement is justified by genuine technical progress, but the timeline from demos to deployment is longer and harder than most narratives suggest.

We believe the infrastructure layers of robotics, including evaluation systems, data representation, edge inference, and hardware-software co-design, are where durable value will be created. We are actively mapping this landscape and deepening our relationships with the teams building at the frontier.

More to come as our thesis develops.

Disclaimer: This post is for informational purposes only and reflects the authors’ views as of the date of publication. It is not investment advice and should not be relied upon as such. References to companies, technologies, or market developments are illustrative. Any forward-looking statements are subject to risks and uncertainties and may differ materially from actual results.