Ensure production system reliability by implementing SLO-driven practices, building observability platforms, and automating incident response for critical enterprise applications.
Hyderabad, India
Full-time
Cloud & Infrastructure
Responsibilities
Define and maintain SLIs, SLOs, and error budgets for production services with clear escalation policies
Build and operate observability platforms using Prometheus, Grafana, Loki, and distributed tracing tools
Participate in on-call rotations and lead incident response, conducting thorough blameless postmortems
Automate toil reduction through self-healing systems, runbook automation, and chaos engineering practices
Collaborate with development teams to improve service reliability through design reviews and load testing
Develop and maintain internal SRE tooling for deployment safety, capacity planning, and performance analysis
Requirements
3-5 years of experience in SRE, DevOps, or production engineering roles
Strong understanding of SRE principles including SLOs, error budgets, and toil elimination
Proficiency with monitoring and observability tools (Prometheus, Grafana, ELK/Loki, Jaeger)
Experience with Linux systems administration, networking, and performance troubleshooting
Strong programming skills in Python, Go, or similar languages for building SRE tooling
Experience with incident management processes and blameless postmortem culture
Nice to Have
Experience with chaos engineering tools like Litmus Chaos or Gremlin
Knowledge of capacity planning and traffic forecasting methodologies
Familiarity with AIOps and ML-driven anomaly detection