Principal AI Engineer
Choice Ventures
Software Engineering, Data Science
New South Wales, Australia
About the Company
A world-leading AI infrastructure scale-up building the next generation of AI Factories; purpose-built, liquid-cooled environments engineered to maximise AI performance, efficiency, and profitability.
The company operates at the intersection of AI, cloud infrastructure, energy systems, and high-performance computing. They innovate across every layer of the AI stack, from energy grid management and thermal engineering through to GPU orchestration, networking, telemetry, and control systems
At the heart of the platform is a global AI Cloud that delivers energy-efficient GPU compute to developers, enterprises, universities, and governments worldwide. Their mission is simple: build the world's most efficient AI infrastructure.
The Role and What You'll Be Doing...
As a Principal AI Engineer within the AI & Applications team, you'll own and evolve the control plane powering AI workload orchestration across large-scale GPU infrastructure.
You'll design and build the APIs, services, tooling, and user experiences that enable customers to train, fine-tune, and deploy AI models across Kubernetes and Slurm environments at scale.
This is a highly technical, hands-on role where you'll work across distributed systems, workload scheduling, multi-tenant infrastructure, observability, and AI platform engineering. You'll partner closely with Platform Engineering, ModelOps, LLM Engineering, and Infrastructure teams to build a world-class AI platform used by some of the most demanding customers in the world.
Key Responsibilities
- Design and build unified job submission APIs, CLI tooling, and web interfaces for AI training, inference, and fine-tuning workloads.
- Develop robust workload orchestration capabilities across Kubernetes and Slurm environments.
- Define and maintain scalable job metadata models covering identity, tenancy, lineage, resource allocation, priority, and lifecycle management.
- Implement multi-tenant controls, including RBAC, quota management, isolation policies, and governance frameworks.
- Build intelligent scheduling systems incorporating priority classes, fairness algorithms, pre-emption, quota enforcement, and workload routing.
- Create and maintain an AI Factory template catalogue to enable repeatable deployment patterns across training and inference workloads.
- Expose rich telemetry and analytics APIs covering GPU utilisation, model efficiency, throughput, latency, token generation, and infrastructure costs.
- Partner with cross-functional teams to define platform standards, reliability frameworks, observability practices, and operational excellence
About You
You are an experienced systems engineer who enjoys solving complex infrastructure and platform challenges at scale.
You'll likely bring:
- 5+ years of experience building large-scale distributed systems, platform services, or cloud infrastructure.
- Strong software engineering capability in Python, Go, Java, or similar.
- Deep Kubernetes expertise, including controllers, scheduling, RBAC, networking, security, resource management, and production troubleshooting.
- Hands-on experience operating and optimising Slurm-based HPC or AI environments.
- Strong understanding of distributed systems principles, workload scheduling, fairness algorithms, resource allocation, and scalability patterns.
- Experience designing APIs, platform abstractions, and developer tooling.
- Strong observability mindset across monitoring, logging, tracing, and production ope
- rations.
- Experience with real-time systems, streaming architectures, webhooks, and event-driven platforms.
- A passion for building reliable, elegant infrastructure that enables AI workloads to operate at scale.
Why This Role
- Build infrastructure powering the next generation of AI.
- Work on some of the most complex challenges in AI infrastructure, GPU orchestration, and distributed systems.
- Influence architecture and technical direction at a global scale.
- Collaborate with world-class engineers across AI, cloud, networking, and high-performance computing.
- Join a rapidly growing company shaping the future of AI infrastructure worldwide.