Sr. TL SRE
hirebridge
Job Description
Roles and Responsibilities
· Own production services end to end. Accountable for reliability, availability, scalability, performance, and operational health.
· Define and manage SLIs and SLOs, using error budgets to guide delivery decisions.
· Influence of service and system design to improve fault tolerance, observability and operational sustainability.
· Debug complex production issues across application code, services and infrastructure using software engineering practices.
· Perform root cause analysis using logs, metrics, traces, and code-level investigation.
· Build automation and self-healing mechanisms to prevent repeat failures.
· Execute production changes (patching, certificate management, software releases) with safety, automation, and observability.
· Design and operate production observability aligned to service health and customer impact.
· Lead and participate in incident response for high-severity events.
· Collaborate with engineering, product, architecture, and operations teams.
· Operate with autonomy and sound judgment in reliability decisions.
Skills & Requirements
Qualifications:
· 8 t0 12 years of hands-on Site Reliability Engineering or reliability-focused engineering experience with end-to-end service ownership.
· Proven operation at a senior engineering scope with accountability for reliability outcomes.
· Strong software engineering skills in C#, .NET, Java, Python, React, or similar technologies.
· Practical experience applying SRE principles (SLIs, SLOs, error budgets).
· Hands-on experience with AWS, Kubernetes, CI/CD, infrastructure as code and hybrid environments.
· Strong knowledge of Linux and Windows systems, application platforms and relational databases.
· Bachelor’s or master’s degree in computer science or equivalent experience.
· Participation in an on-call rotation; flexible hours as required.