Senior Solutions Architect, Cluster Design and Architecture - Networking
Excelero Storage
NVIDIA is building the world’s most groundbreaking and innovative accelerated computing platforms for AI and HPC. Because of our work, scientists, researchers, and engineers can push the boundaries of what’s possible. We pioneered a supercharged form of computing that powers everything from breakthrough AI research to the world’s fastest supercomputers. We are seeing a highly motivated Senior Solutions Architect to join the Cluster Design and Architecture team with a focus on networking technologies. As AI workloads scale to unprecedented levels, the network is the backbone that makes large compute clusters possible. In this role, you will be at the forefront of assisting with designs and architectures for next-generation networking solutions that connect thousands of GPUs and enable the world’s most advanced AI supercomputers and enterprise AI infrastructure in the field.
As a Solutions Architect, you will act as a key technical expert connecting NVIDIA’s new networking technology builds. These include Infiniband, Spectrum-X, NVLink, and all software solutions. You will work directly between engineering and field teams to support customers with fast paced requirements. You will work on end-to-end cluster design, network topology and architecture optimization, performance modeling and validation. Your expertise will directly influence how the world’s leading AI companies, cloud providers, hyperscalers, research institutions, and enterprises build their infrastructure.
What you’ll be doing:
Partner with internal engineering efforts in GPU cluster building and networking and convey architecture and guidelines information both direct to customer and with field teams supporting customers
Guide field teams and their customers in cluster design, weighing design principles but also complex, situational limitations to make the most performant and supportable GPU clusters possible
Work closely with field teams supporting customers to ensure successful first deployments with new products, including new network architectures and topologies
Feedback customer/field perspectives on networking development and workflows back to engineering teams building internal clusters and/or composing customer facing documentation on guidelines and service flows
Perform hands-on work to assist field teams debugging issues relating to network build, configuration, and performance, bringing to bear internal engineering expertise and known bugs
What we need to see:
BS, MS, or PhD in Computer Science, Electrical Engineering, Computer Engineering, Physics, or related field (or equivalent experience)
8+ years of experience in network architecture, network design, network validation and troubleshooting
Proven expertise in designing large-scale distributed systems, AI clusters, or HPC infrastructure
Ability to translate complex engineering concepts into customer-ready documentation, diagrams, and reference material
Ways to stand out from the crowd:
Experience leading large-scale AI Factory or HPC cluster bring-ups or builds
Hands-on experience with NVIDIA networking products including, but not limited to, Infiniband, Spectrum-X, BlueField, etc.
Knowledge of NCCL, MPI, and collective communication patterns in distributed training as it pertains to networking patterns and design
Background in network performance optimization, congestion control, and validation at scale
External customer facing skill-set and background
NVIDIA is widely considered to be one of the technology world’s most desirable employers with very competitive benefits. We have some of the most forward-thinking and innovative people in the world working for us. If you're creative and autonomous, we want to hear from you!
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.You will also be eligible for equity and benefits.