Lead systems operations Engineer

wellsfargojobs

Bengaluru, India 5 Years Exp Posted 15d ago

Job Description

 

In this role, you will:

  • Lead complex, broad impact initiatives including provision of high-level systems consultation for the technology teams
  • Work as key participant in large scale planning of computer systems and network infrastructure for Systems Operations functional area
  • Review and analyze complex technical challenges, as well as escalated support issues related to core business solutions that require in depth evaluation of multiple factors, such as alternatives, enhancements, periodic systems reviews, or improvements to existing systems
  • Make decisions on technical changes and enhancements
  • Consult with engineering team on change design requiring solid understanding of technical process controls or standards that influence and drive new initiatives
  • Collaborate and consult with technical peers, colleagues, and mid to more experienced level managers to resolve systems support issues and achieve goals
  • You’ll lead the transformation of traditional platform operations into a modern Site Reliability Engineering (SRE) model—driving reliability by design, elevating SLIs/SLOs, automating operational toil, strengthening observability, and maturing incident & problem management. You’ll be hands-on while mentoring Ops and Engineering teams to adopt SRE practices at scale across the platform ecosystem.

Required Qualifications:

  • 5+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

Job Expectations:

  • Reliability & Performance
    • Define and implement SLIs/SLOs and error budgets for critical platform services; drive SLO adoption across product and operations teams.
    • Build, enhance, and tune end-to-end observability (metrics, logs, traces) with focus on golden signals: latency, traffic, errors, saturation.
    • Partner with performance engineering teams to run load, stress, soak, and failover tests; identify and eliminate performance bottlenecks.
  • Platform & Automation
    • Identify and eliminate operational toil; implement automation and AI-driven workflows for reliability and operational excellence.
    • Generate AI-based observability assessments, maturity scoring, and gap analysis for all platform applications.
    • Build self-service reliability tooling: automated runbooks,  readiness checkers, golden paths, and standard reliability patterns.
  • Incident, Problem & Change
    • Lead Major  incidents as Incident Commander; ensure clear communication, rapid triage, and timely restoration.
    • Facilitate blameless postmortems, document corrective actions, and ensure follow-through.
    • Strengthen platform-level problem management through trend analysis, recurring issue elimination, and proactive risk reduction.
  • Culture & Enablement
    • Coach and mentor platform engineering, ops, and product teams on SRE principles and reliability-first mindset.
    • Define and maintain SRE maturity models, track adoption, and provide continuous improvement recommendations.
    • Ensure documentation—runbooks, dashboards, readiness checklists, reliability reviews—remains current, actionable, and standardized.

Required Qualifications:

  • Experience: 5+ years in large-scale distributed systems; minimum 5+ years hands-on experience in SRE, DevOps, or Platform Engineering.
  • Cloud: Expertise in one or more: AWS, Azure, GCP (cloud certifications preferred).
  • IaC & Automation: Terraform, Ansible/Chef; strong Git and GitOps practices.
  • Observability: Hands-on experience with Prometheus, Grafana, OpenTelemetry, ThousandEyes, AppDynamics, Aternity.
  • CI/CD: Azure DevOps, GitHub Actions, Jenkins, or GitLab CI; strong understanding of artifact management & environment promotion workflows.
  • Programming: Proficiency in Python/Go/Java for scripting, automation, and API integrations.
  • Reliability Practices: SLIs/SLOs, error budgets, capacity planning, canary/blue‑green deployments, chaos engineering, DR testing.
  • Processes: Strong knowledge of Incident/Problem/Change management, blameless postmortems, on‑call operations, and runbook development.
    • Excellent communication, documentation, and cross-team collaborati

Similar Openings for You