Chinese Embodied AI & Robotics Foundation Models

Vision-Language-Action (VLA) and embodied foundation models from Chinese labs and robotics companies. These run on robots, not behind a token API — so this is a research-and-capability map, not a pricing comparison.

Not an API catalog. Unlike the LLM API models, these are open-weight or hardware-bound research models. You download weights and run them on robot hardware — there is no per-token pricing. Every spec below links to its primary source.

Vision-Language-Latent-Action ViLLA

Vision-Language-Action VLA

Diffusion Transformer Diffusion