Senior Site Reliability Engineer (SRE) – ML & Data Science Platforms at NVIDIA
Join NVIDIA’s Cutting-Edge AI & Data Science Team!
Are you passionate about large-scale production systems, machine learning (ML), and data science? Do you want to be part of a team that powers NVIDIA’s data-driven decision-making? We’re hiring a Senior Site Reliability Engineer (SRE) to design, build, and optimize platforms that enable real-time analytics, AI training, and inferencing at scale.
At NVIDIA, you’ll work on innovative cloud infrastructure, Kubernetes, observability tools, and high-performance computing—helping shape the future of AI. If you thrive in a fast-paced, collaborative environment, this role is for you!
What You’ll Be Doing
✅ Design & maintain scalable, reliable systems for ML training, data lakes, and streaming analytics (Kafka, Spark, Kubernetes).
✅ Automate operations—reduce manual tasks, improve efficiency, and enhance system observability (Prometheus, ELK, Grafana).
✅ Optimize SLOs & SLAs—apply SRE best practices (error budgets, incident management, blameless postmortems).
✅ Collaborate with data scientists & engineers to improve platform performance, latency, and scalability.
✅ Lead capacity planning—ensure seamless scaling across public, private, and hybrid clouds.
✅ Build CI/CD pipelines (Jenkins, GitHub Actions) and Infrastructure as Code (IaC) solutions.
What We Need to See
Minimum Qualifications:
-
10+ years in SRE, DevOps, or Cloud Engineering with large-scale microservices.
Other Posts You May Be Interested In
- Entry level Job Senior AI Engineer Python & LLM Engineer-job id-9053
- vacancy remote job Senior UI UX Designer Data & AI Platform in USA job id – 9051
- Flexible Schedule Job Solutions Consultant -job id -9050
- Part time job Grants Auditor Hybrid with Travel job id -9049
- Flexible Schedule Job Freelance English Annotators job in usa id-9048
- Contract Software Development Engineer job in usa job id -9047
- Remote Token Plan Administrator job in usa job id -9046
- vacancy remote job Social Media Manager in USA job id-9045
- Entry level Job Senior Golang Software Engineer in USA job id -9044
- Part-Time Event Assistant (Remote – EU Preferred, USA Considered) job id-9043
- Remote job Senior Staff Engineer Infrastructure job id -9021
- Internship Head of APAC BD Reomte job id -9020
- Flexible Schedule Job Project Manager job id -9019
- Part time job Developer Relations Engineer Remote work from home-job id -9018
- Part time job Developer Relations Engineer-Part Time Job -9017
- Remote Site Reliability Engineer
- Support Job Site Reliability Engineer-job id-9016
- Zero experience jobs Event Assistant-job id-9014
- USA JOB Compliance Operations Specialist job id-9013
- Entry level Job Enterprise Cloud Architect USA job id-9012
- Contract Senior WordPress Plugins Developer USA job id-9011
- Remote Job: Senior AI Engineer / Python & LLM Engineer USA job id-9010
- Internship Event Assistant job in USA job id-9009
- Support Job Senior Backend Engineer job in usa job id-9008
- Internship Onboarding Specialist French Speaking Spain Italy Portugal UK job id -9006
- (WFH) Virtual Assistant $25 Hourly job in USA job id-9007
- Zero experience jobs Registered Dietitian Remote job USA job id – 9005
- Support Job Registered Dietitian in USA job id -9004
- Remote Personal Assistant (Full-Time) – Work from Anywhere in the USA-job id-9003
- Zero experience jobs Head of APAC BD job in usa -job id-9002
- Remote Staff Physician at Vida Health in USA job id-9042
- Part time job Content and Community Lead-job id-9041
- Internship Sales Development Representative ANZ job id -9040
- Work from home Lifecycle Marketer job id-9039
- Work from home Site Reliability Engineer job id in 9038
- Remote job Account Executive in NY USA job id-9037
- Software Engineering Team Lead – Ethereum MEV | Remote job id-9032
- (WFH) Market Research Executive in USA job id-9035
- Permanent Job Virtual Assistant $25 Hourly -Remote job in usa id-9030
- Work from home Flutter Developer – job id-9028
- (WFH) Project Manager job in usa – job id-9001
-
Master’s/Bachelor’s in Computer Science, Electrical Engineering, or related field (or equivalent experience).
-
Expertise in Python, Go, or Ruby—strong coding skills for automation & tooling.
-
Deep knowledge of Kubernetes, OpenStack, CI/CD, and IaC.
-
Experience with observability tools (Prometheus, ELK, Grafana) and distributed systems.
-
Strong problem-solving, debugging, and performance optimization skills.
Preferred Skills (Stand Out from the Crowd!):
-
Experience with large-scale ML/data platforms (Spark, Kafka, TensorFlow).
-
Strong background in AI/ML infrastructure and high-performance computing (HPC).
-
Leadership in incident management & postmortems.
-
Excellent communication & collaboration skills.
Why NVIDIA?
🚀 Work on groundbreaking AI & accelerated computing—powering everything from self-driving cars to generative AI.
🌍 Global impact—solve challenges that transform industries.
💡 Innovative culture—blameless postmortems, risk-taking, and continuous learning.
💰 Competitive compensation—base salary range $224,000 – $425,500 USD + equity & benefits.
🌈 Inclusive workplace—NVIDIA is proud to be an equal opportunity employer.
How to Land This Job (Application Tips)
-
Tailor Your Resume – Highlight SRE, Kubernetes, AI/ML infrastructure, and automation experience.
-
Showcase Problem-Solving – Prepare examples of incident resolution, system scaling, and optimization.
-
Demonstrate Coding Skills – Be ready for Python/Go coding assessments.
-
Research NVIDIA’s Tech Stack – Know their work in AI, CUDA, and high-performance computing.
-
Prepare for Behavioral Interviews – Emphasize collaboration, leadership, and innovation.
FAQs About the Role
1. What does an SRE at NVIDIA do?
NVIDIA SREs ensure high availability, scalability, and performance of ML/data platforms. They automate operations, optimize cloud infrastructure, and collaborate with AI/ML teams.
2. What tech stack is used?
-
Cloud: Kubernetes, OpenStack, AWS/GCP/Azure
-
Monitoring: Prometheus, ELK, Grafana
-
Data/ML: Kafka, Spark, TensorFlow
-
Automation: Python, Go, Terraform, Jenkins
3. Is remote work available?
NVIDIA offers hybrid/remote options depending on location.
4. What’s the career growth like?
From Senior SRE, you can grow into Principal Engineer, Cloud Architect, or AI Infrastructure Lead.
5. How does NVIDIA support diversity?
NVIDIA is committed to inclusion, offering mentorship programs and employee resource groups (ERGs).
Apply Now & Power the Future of AI!
If you’re ready to build the infrastructure behind AI breakthroughs, apply today! NVIDIA is hiring globally—submit your resume and join a team that’s transforming computing.
Freelance Senior Site Reliability Engineer ML Platforms- job id -9054