Associate Director Application Support Engineering

oraclecloud

Chennai, India 8 Years Exp Posted 1d ago

Job Description

  • Scrum Participation: Join all project collaborators planning and design sessions, sprint zero and stand-ups for all new delivery, to champion SRE requirements reflective of a strong observability and resiliency traits.
  • System Reliability Architecture: Drive Design and help implement reliable, resilient, and scalable systems, considering redundancy, fault tolerance, and disaster recovery strategies. Make design recommendations that will allow the application to recover without cleanup activities or create a recovery runbook for application support team to follow for improved application recovery times.
  • Monitoring and Alerting: Develop comprehensive monitoring systems to identify potential issues proactively, define actionable alerts, and establish SLIs (Service Level Indicators) and SLOs (Service Level Objectives).
  • Incident Management: Lead incident response during critical system outages, facilitating timely problem diagnosis and resolution, conducting post-mortem analysis to identify root causes and prevent future occurrences.
  • Automation and Tooling: Develop and maintain automation scripts to streamline operational tasks, including self-healing, application deployments, scaling, and infrastructure management.
  • Collaboration with Development Teams: Work closely with development teams to integrate SRE practices into the software development lifecycle, promoting code quality, reliability, and observability.
  • Security Integration: Collaborate with security teams to ensure system resilience against cyber threats, implementing security best practices and supervising for vulnerabilities.
  • Technical Expertise: Stay updated on emerging technologies and industry trends related to cloud computing, distributed systems, and reliability engineering.
  • Operational Readiness: Attend and present operational readiness with application support (EAS L2) at each project management meeting - raise any operational risks and concerns. Test SRE requirements in UAT environments to validate effectiveness and completeness of operational capabilities.
  • Risk Management: Partner with IT Embedded Risk Managers to identify strategic solutions for risk incidents.
  • Metrics and Reporting: Demonstrate operational improvements through defined KPIs.
  • Capacity Planning: Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under high load.
    • Performance Optimization: Analyze system performance metrics to identify bottlenecks and implement optimization strategies to improve system responsiveness and efficiency.

Similar Openings for You