Expert Site Reliability Engineer
finastra
Job Description
Objectives of this Role
-
Work in tandem with our engineering team to identify and implement the most optimal cloud-based solutions for the company.
-
Define and document best practices and strategies regarding application deployment and infrastructure maintenance.
-
Provide guidance, thought leadership, and mentorship to development teams to build cloud competencies.
-
Ensure application performance, uptime, and scale, maintaining high standards of code quality and thoughtful design.
-
Managing cloud environments in accordance with company security guidelines.
-
Stay current with industry trends, making recommendations as needed to help the organization innovate and excel.
Responsibilities
-
Develop, deploy and maintain infrastructure on Azure using Docker and Kubernetes.
-
Implement automation tools and frameworks (CI/CD pipelines).
-
Collaborate with team members to improve the company’s engineering tools, systems and procedures, and data security.
-
Optimize the company’s computing architecture.
-
Conduct systems tests for security, performance, and availability.
-
Develop and maintain design and troubleshooting documentation.
-
Collaborate with the engineering teams to enable their applications to run on Cloud infrastructure.
-
Debugging technical issues inside a complex stack involving virtualization, containers, microservices, etc.
-
Troubleshoot incidents, identify root cause, fix and document problems, and implement preventive measures.
-
Employ exceptional problem-solving skills, with the ability to see and solve issues before they snowball into problems.
Requirements
-
Bachelor’s degree in computer science, information technology, or mathematics
-
8+ years of proven experience as a Site Reliability Engineer or similar role in software development and system administration.
-
Experience in Docker for containerization and application deployment.
-
Experience with Kubernetes and Helm for orchestration of Docker containers.
-
Experience with Azure cloud services and understanding of their offerings and architecture.
-
Knowledge of databases and operating systems.
-
Ability to troubleshoot complex software and hardware issues.
-
Knowledge of best practices related to data encryption and cybersecurity.
-
Excellent problem-solving and communication skills.
-
Experience in network, server, and application-status monitoring.
-
Operating systems – any Linux/Unix flavor
-
Monitoring – Prometheus, Grafana