STN Inc
Site Reliability Engineer
Remote
Role brief
What this role is asking for.
SITE RELIABILITY ENGINEER Platform and software ยท shared across customers Reports to: Director, Site Reliability Location: Remote (US) Department: Cloud Platform Engineering / SRE/Reliability POSITION SUMMARY The Site Reliability Engineer (SRE) owns reliability, observability, and incident response for the GPU One (GPUaaS) platform. The SRE defines and enforces SLOs aligned with contractual SLAs, builds the observability stack, and leads major incidents to resolution. KEY RESPONSIBILITIES - Define and operate Service Level Objectives (SLOs) aligned with customer SLAs - Build and maintain the observability stack including metrics, logs, traces, and alerting - Lead incident response and chair post-incident reviews - Drive automation to reduce toil and improve mean-time-to-recover (MTTR) - Author and maintain operational runbooks alongside the NOC - Manage on-call rotation, escalation paths, and incident-management tooling - Coordinate cross-functionally with NOC, Platform Engineering, and Network Engineering - Drive chaos engineering, game days, and reliability testing programs - Produce SLA performance reports in coordination with the SLA Manager - Mentor junior engineers and contribute to engineering culture REQUIRED QUALIFICATIONS - 5+ years in SRE, DevOps, or production engineering roles - Strong programming skills in Go, Python, or both - Hands-on experience operating Kuber...
Company role signals