The newest physical AI systems can inspect their environment, connect what they see to a goal, and adjust behaviour in response. This ability is termed Vision-Language-Action by Capgemini, who expanded on the subject in a recent blog post. VLA links perception and action in an operational loop, the company states.
Visual Language Models give AI systems a way to relate images to language and vice versa. The company claims robots could identify objects and answer questions about items and activities in their visual perception. Non-passive models of robot can describe a defect or an item, but can’t decide what to do next based purely on their perception. Typically machines capable of doing so rely on systems hosted elsewhere in the facility.
The vision presented by Capgemini is one of robots receiving instructions in human language, interpreting every scene in which they operate, and choosing actions that fit instructions and context.
As a term, VLA doesn’t describe a new, standalone product category, but a device equipped with an additional compute layer. The success of VLA deployments depend on sensors, control systems, simulation, safety mechanisms, and infrastructure, the company says.
Constraints on robots operating in the physical world are rightly stricter than in digital domains. Latency, energy consumption, and safety matter to a massively increased degree, and Capgemini states that digital twins are important stages in the development process, exposing systems to various conditions they might meet. Any test of practicality means a host of external factors, too: efficient data infrastructure in and out of physical AI devices are needed, and the full gamut of on-edge inference, training, and safety controls aligned with VLA systems and every element performing to ensure proper input and output. Without those surrounding abilities, the model alone has limited value and could pose a risk to safety and operational results.
Industrial automation is built to be predictable. Systems perform well when there’s little moment-by-moment variation in the surrounding processes, which are stable and predictable. When an environment changes or components vary, costs appear as downtime and re-engineering effort, which VLA hopes to address.
Giving robots flexibility to interpret situations and choose actions is the promise of VLA. Capgemini states that physical robots could progress from fixed logic to an ability for adaptation. Engineering teams wouldn’t have to code every use case, it says, but would allow an AI to capture its own attenuation by decision-making and on-the-fly adaptation.
Simulation in the form of digital twins has to represent real-world performance and environments, with feedback loops to ensure that drift, failure, and edge cases are correctly acted on. The company refers to a ‘data flywheel’ which describes a loop in which performance improves through multiple interactions. And yet, human operators have to be on hand during training and operation, the company says.
The early focus of business leaders should be on capturing real-life operator workflows, which are likely to contain knowledge that wouldn’t necessarily appear in machine and employee manuals. Post-inference attenuations that normally would be required at a code level may be less important given the inherent, on-board abilities of VLA physical AI. But it would remain the individual facility operator’s responsibility to cover off safety, cybersecurity, certification, and provide transparency into AI actions. Throughout testing and deployment, business metrics like cycle time, yield, downtime, and near misses will need to be gathered and examined carefully.
Capgemini attest that a well-integrated VLA layer can improve the performance of existing assets and reduce the cost of change to processes, thus giving organisations an agility that static installations cannot offer. It predicts that human roles will become supervisory, handling exceptions and orchestrating machines.
VLA could be seen as giving robots a cognitive layer via the combination of perception, natural language instructions, and physical actions. Prediction, the ability to model what is likely to happen next in a dynamic environment, will be difficult, and companies need to trust that their AI-driven physical devices have the smarts to cope, creatively, with edge cases. VLA may give robots a way to respond, and their environmental models may give them the ability anticipate. This transition will shape the next phase of physical AI.
(Image source: “Tillamook Cheese Factory” by CarolMunro is licensed under CC BY-NC 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/2.0)
Want to learn more about IoT from industry leaders? Check out IoT Tech Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and co-located with other leading technology events. Click here for more information.
IoT News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.



