Resiliency Engineer
digitalxnode
Job Description
Key Responsibilities:
- DR Automation: Develop and implement automated solutions for infrastructure components to streamline failover processes and reduce recovery time objectives (RTOs).
- Recovery Plan Development: Create, maintain, and test comprehensive recovery plans for critical applications and systems.
- DR Testing and Validation: Conduct regular DR drills and tests to validate recovery procedures and identify areas for improvement.
- Infrastructure Automation: Utilize automation tools (e.g., Ansible) to automate infrastructure tasks and enhance resiliency.
- Collaboration: Work closely with infrastructure, application, and business teams to ensure alignment with DR and BCP strategies.
- Continuous Improvement: Identify opportunities to improve DR processes, reduce recovery times, and enhance overall system resilience.
Qualifications and Experience:
- A bachelor’s degree in engineering, computer science, or a similar discipline
- 5+ years of experience in infrastructure engineering and automation
- 3+ years of experience with cloud computing (AWS, Azure)
- Strong proficiency in scripting languages (Python, PowerShell)
- Familiarity with Ansible, Puppet, and Chef, three configuration management tools
- Knowledge of ITIL frameworks and best practices
- Excellent problem-solving and troubleshooting skills
- Strong communication and collaboration skills