Site Reliability Engineer II

myworkdayjobs

Bengaluru, India NM Years Exp Posted 31d ago

Job Description

Job Description:

  • Monitor, measure and improve the reliability, availability and scalability of Forcepoint products and infrastructure
  • Engage in Incident response and participate in post-mortem analysis to investigate root cause and capture contributing factors for remediation
  • Perform analytics on previous incidents and trend/usage patterns to better predict issues and take proactive actions
  • Design and build custom tools as needed to support process optimization, challenging the status-quo and improving operational efficiency
  • Participate in 24*7 rotational shifts & On-Call for handling production operation issues
  • Identify manual routine operational practices and build robust automation capabilities using code and modern tools
  • Review and create dashboards/reports for application telemetry and infrastructure health for pro-actively identifying performance constraints and bottlenecks
  • Monitor product performance and availability, and provide feedback to develop, test, and implement robust monitoring, alerting, and logging solutions.
  • Work collaboratively with software developers to promote best practices in reliability and operability, including code reviews and architectural discussions.
  • Participate with stakeholders to monitor our products, ensuring that the products meet architecture & observability design requirements

 

 

Requirements:

  • Strong understanding of cloud-based architecture and operations. Hands-on experience with Amazon Web Services is preferred.
  • Experience in administration/build/management of Linux systems
  • Foundational understanding of Infrastructure and Platform Technology stacks
  • Strong understanding of Networking concepts and theories, such as different protocols (TCP/IP, UDP, routing protocols, etc), VLAN configuration, DNS, OSI layers, and load balancing
  • Understanding of security architecture and certificate management
  • Working knowledge of Infrastructure and Application monitoring platforms such as Grafana Cloud, Xymon, LibreNMS etc.  
  • Working knowledge of Incident Response and Alerting platforms such as PagerDuty, Opsgenie, XMatters etc.
  • Understanding of the core DevOps practices (CI/CD pipeline, release management etc.)
  • Ability to write code using any one modern programming language (Python, JavaScript, Ruby etc.). Additional scripting skills are preferred
  • Configuration management platform understanding and experience (Chef/Puppet/Ansible)
  • Prior experience in Cloud management automation tools (Terraform/CloudFormation etc.)
  • Experience with source code management software and API automation is crucial
  • Cloud certifications or equivalent experience is highly regarded
  • Service availability oriented mindset with a pro-active approach to problem solving. An ideal candidate should be able to develop automated solutions to prevent recurring problems
  • Possesses the ability and willingness to challenge the status-quo and optimize current procedures and processes
  • Strong sense of ownership and an ability to drive cross-functional process improvement
  • Possesses excellent inter-personal, written and verbal communications skills
  • Analytical and logical approach to problem-solving and a willingness to automate repetitive tasks and reduce manual/re-active workload
    • Ability and willingness to coach and mentor Team members and colleagues

Similar Openings for You