Software Engineer, Generative AI Research
Excelero Storage
We are now looking for a Senior Software Engineer for Generative AI Research! At NVIDIA, we believe the next generation of AI will be physical AI – systems that perceive, reason, and act in the real world. Building these models requires building robust systems that span across large-scale compute, multimodal datasets, simulation-driven synthetic data, and real-time reasoning for robots and autonomous systems.
Our Cosmos infrastructure team sits at the heart of this mission. We build the systems that make it possible to train Cosmos, NVIDIA’s world foundation model for physical AI. Cosmos enables large-scale AI models for robots, autonomous agents, and AI systems to understand, plan, and act in complex environments. Our team develops the Cosmos platform infrastructure that powers model training, data pipelines, simulation, and deployment at scale, enabling research and production to move faster and more efficiently than ever before. This role is a unique opportunity to work on infrastructure that directly enables physical AI at scale – from optimizing massive data pipelines to designing training workflows that support foundation models, and from scaling distributed compute systems to building the backbone for simulation-driven experimentation.
What You’ll Be Doing:
Design, build, and operate scalable infrastructure for training Cosmos and supporting large-scale data pipelines
Develop high-throughput systems for data processing, retrieval, and workflow orchestration
Collaborate across research, optimization, and platform teams to accelerate experiments and deployments
Improve system reliability, performance, and observability across distributed compute environments
Contribute to long-term infrastructure strategy for training, data management, and large-scale compute efficiency
What We Need to See:
A Masters Degree in Computer Science, Computer Engineering, related STEM Degree, or equivalent experience.
Strong engineering background in distributed systems, ML infrastructure, or large-scale compute/data platforms with 6 years of relevant work experience
Proficiency in Python and at least one systems language (e.g., C++/Go/Rust)
Experience with orchestration systems, scheduling, and scalable storage or data pipelines
Ability to work across teams, drive technical clarity, and deliver robust solutions in complex environments
Comfortable bridging research workflows and production-grade systems
Ways to Stand Out from the crowd:
Experience building or optimizing infrastructure for large-scale model training
Hands-on work with distributed compute environments or high-performance systems
Familiarity with synthetic data, simulation pipelines, or large multimodal datasets
Contributions to open-source infrastructure or large-scale internal tooling
You will also be eligible for equity and benefits.