Senior Site Reliability Engineer (SRE) – ML & Data Science Platforms at NVIDIA

Join NVIDIA’s Cutting-Edge AI & Data Science Team!

Are you passionate about large-scale production systems, machine learning (ML), and data science? Do you want to be part of a team that powers NVIDIA’s data-driven decision-making? We’re hiring a Senior Site Reliability Engineer (SRE) to design, build, and optimize platforms that enable real-time analytics, AI training, and inferencing at scale.

At NVIDIA, you’ll work on innovative cloud infrastructure, Kubernetes, observability tools, and high-performance computing—helping shape the future of AI. If you thrive in a fast-paced, collaborative environment, this role is for you!

What You’ll Be Doing

✅ Design & maintain scalable, reliable systems for ML training, data lakes, and streaming analytics (Kafka, Spark, Kubernetes).
✅ Automate operations—reduce manual tasks, improve efficiency, and enhance system observability (Prometheus, ELK, Grafana).
✅ Optimize SLOs & SLAs—apply SRE best practices (error budgets, incident management, blameless postmortems).
✅ Collaborate with data scientists & engineers to improve platform performance, latency, and scalability.
✅ Lead capacity planning—ensure seamless scaling across public, private, and hybrid clouds.
✅ Build CI/CD pipelines (Jenkins, GitHub Actions) and Infrastructure as Code (IaC) solutions.

What We Need to See

Minimum Qualifications:

10+ years in SRE, DevOps, or Cloud Engineering with large-scale microservices.
Other Posts You May Be Interested In
Master’s/Bachelor’s in Computer Science, Electrical Engineering, or related field (or equivalent experience).
Expertise in Python, Go, or Ruby—strong coding skills for automation & tooling.
Deep knowledge of Kubernetes, OpenStack, CI/CD, and IaC.
Experience with observability tools (Prometheus, ELK, Grafana) and distributed systems.
Strong problem-solving, debugging, and performance optimization skills.

Preferred Skills (Stand Out from the Crowd!):

Experience with large-scale ML/data platforms (Spark, Kafka, TensorFlow).
Strong background in AI/ML infrastructure and high-performance computing (HPC).
Leadership in incident management & postmortems.
Excellent communication & collaboration skills.

Why NVIDIA?

🚀 Work on groundbreaking AI & accelerated computing—powering everything from self-driving cars to generative AI.
🌍 Global impact—solve challenges that transform industries.
💡 Innovative culture—blameless postmortems, risk-taking, and continuous learning.
💰 Competitive compensation—base salary range $224,000 – $425,500 USD + equity & benefits.
🌈 Inclusive workplace—NVIDIA is proud to be an equal opportunity employer.

How to Land This Job (Application Tips)

Tailor Your Resume – Highlight SRE, Kubernetes, AI/ML infrastructure, and automation experience.
Showcase Problem-Solving – Prepare examples of incident resolution, system scaling, and optimization.
Demonstrate Coding Skills – Be ready for Python/Go coding assessments.
Research NVIDIA’s Tech Stack – Know their work in AI, CUDA, and high-performance computing.
Prepare for Behavioral Interviews – Emphasize collaboration, leadership, and innovation.

FAQs About the Role

1. What does an SRE at NVIDIA do?

NVIDIA SREs ensure high availability, scalability, and performance of ML/data platforms. They automate operations, optimize cloud infrastructure, and collaborate with AI/ML teams.

2. What tech stack is used?

Cloud: Kubernetes, OpenStack, AWS/GCP/Azure
Monitoring: Prometheus, ELK, Grafana
Data/ML: Kafka, Spark, TensorFlow
Automation: Python, Go, Terraform, Jenkins

3. Is remote work available?

NVIDIA offers hybrid/remote options depending on location.

4. What’s the career growth like?

From Senior SRE, you can grow into Principal Engineer, Cloud Architect, or AI Infrastructure Lead.

5. How does NVIDIA support diversity?

NVIDIA is committed to inclusion, offering mentorship programs and employee resource groups (ERGs).

Apply Now & Power the Future of AI!

If you’re ready to build the infrastructure behind AI breakthroughs, apply today! NVIDIA is hiring globally—submit your resume and join a team that’s transforming computing.

🔗 Apply Now on NVIDIA Careers

Freelance Senior Site Reliability Engineer ML Platforms- job id -9054