Cloud DevOps
citi
Job Description
Responsibilities:
- Perform full lifecycle DevOps activities (design, develop, test, implement, maintain) for new and existing observability tools/platforms in an enterprise environment with on-premises and cloud-integrated systems/applications/infrastructure.
- In support of Observability Platform proof-of-concepts, pilots, and production implementations: Gather functional/business requirements, establish success criteria, and develop use cases.
- Proactively design telemetry strategies to gain real-time insights and identify potential issues before they escalate.
- Apply knowledge and experience to the following:
- Telemetry data collection, analysis, and implementation to derive meaningful insights from different sources including metrics, events, logs, and traces.
- Distributed systems including those with microservices and hybrid infrastructure (cloud/on-premises) to effectively design telemetry pipelines, build monitoring systems, and implement observability practices.
- Identify patterns, detect anomalies, troubleshooting incidents and build a holistic understanding of system/application/infrastructure behavior to optimize resource allocation, enhance user experience, support compliance and security requirements.
- Collaborate across different observability domains to include Infrastructure, Applications (APM), Networking and close these gaps with the cross-functional skills.
- Assist in development of observability backup recovery methodologies.
- Work with Team Lead and/or Observability Project Manager to prioritize efforts and meet deliverable timelines as well as participating in briefing program leadership and liaising with government customers and other stakeholders.
- Leverage approved systems for incident/change management (SNOW), work items (Jazz/Jira), documentation (Confluence), and others.
Qualifications:
- 5+ yrs of relevant experience
- Three plus (3+) years observability platform/tools experience.
- Subject matter expertise in telemetry.
- Familiarity with integration architecture methods (RESTful, RPC).
- Familiarity with Java or Python programming and/or code debugging/testing.
- Familiarity with software development lifecycle.
- Proficiency in scripting languages (Bash).
- Experience with underlying databases used by observability platforms.
- Experience utilizing container technologies like Kubernetes, Docker, or similar.
- Solid understanding of networking concepts, protocols, and troubleshooting techniques.
- Experience using observability tools such as logging and metrics for debugging (Prometheus, Grafana, Elastic/Kibana).
- Proficiency in production Cloud infrastructure (AWS, GCP, or Azure)