Lead site reliability engineering efforts for large-scale distributed systems, driving 99.99% availability targets through advanced observability, automation, and resilience engineering.
Seattle, USA
Full-time
Cloud & Infrastructure
Responsibilities
Lead SRE strategy and practices across multiple product teams ensuring consistent reliability standards
Architect and maintain enterprise-grade observability platforms using OpenTelemetry, Prometheus, and Grafana
Drive chaos engineering programs to proactively identify failure modes and improve system resilience
Mentor SRE team members and embed reliability practices into the software development lifecycle
Lead high-severity incident response and establish processes for continuous improvement from postmortems
Design capacity planning models and automated scaling strategies for cost-efficient high availability
Requirements
6-9 years of SRE or production engineering experience with large-scale distributed systems
Deep expertise in observability including metrics, logs, traces, and profiling at scale
Advanced Kubernetes operations experience including multi-cluster management and custom controllers
Strong software engineering skills in Go, Python, or Rust for building production-grade SRE tooling
Proven experience managing SLO frameworks and driving reliability improvements across organizations
Experience with cloud infrastructure (AWS or GCP) at scale with multi-region deployments
Nice to Have
Experience leading SRE teams or guilds in a large engineering organization
Background in performance engineering and profiling distributed systems
Skills
SREKubernetesGoOpenTelemetryPrometheusAWSChaos EngineeringDistributed Systems
Apply for this position
Fill in your details below to submit your application.