Senior Software Engineer - Platform Engineering & SRE
equinix
Job Description
Responsibilities
Reliability and Performance
-
Ensure the high availability, reliability, and performance of production systems and services
-
Implement and maintain disaster recovery plans and procedures
-
Monitor and manage system health using metrics, logs, and tracing to proactively identify and resolve issues
Automation and Infrastructure:
-
Automate repetitive tasks, including deployment, scaling, monitoring, and remediation of systems
-
Build and maintain infrastructure as code (IaC) using tools like Terraform, CloudFormation, or similar
Incident Management
-
Participate in incident response and troubleshooting efforts to minimize downtime and resolve issues quickly
-
Conduct root cause analysis for system failures and implement preventive measures to avoid future incidents
-
Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence
-
Maintain incident response playbooks and ensure efficient on-call rotations
Observability and Monitoring
-
Design and implement monitoring solutions using tools like Prometheus, Grafana, Datadog, or similar
-
Define and track SLIs, SLOs, and SLAs to measure and improve system performance
Collaboration
-
Work closely with development, QA, and operations teams to ensure smooth delivery of applications
-
Act as a bridge between software engineering and operations, advocating for DevOps best practices
-
Document system configurations, processes, and procedures to ensure knowledge sharing and maintain system integrity
Capacity and Scalability
-
Conduct capacity planning and optimize system scalability to meet future demands
-
Implement strategies for horizontal and vertical scaling of applications
Security and Compliance
-
Ensure infrastructure security by implementing best practices and addressing vulnerabilities
-
Collaborate with the security team to meet compliance standards and audits
Data Engineering & Automation
-
Design, develop, and maintain scalable and efficient data pipelines
-
Automate data workflows for ETL/ELT processes, integrating data from various sources into data warehouses and other storage solutions
-
Develop and maintain solutions for data transformation, data modelling, and automate the orchestration of data processing
Data Warehouse Management
-
Design, implement, and maintain modern data warehouse architectures, ensuring effective data storage, retrieval, and accessibility
-
Work with cloud-based data warehouses (e.g., BigQuery, Snowflake, Redshift) and optimize data models for analytics and reporting
-
Develop and manage dimensional models, star/snowflake schemas, and data marts for operational and analytical use cases
Real-time and Batch Data Processing
-
Build and manage real-time and batch data pipelines for high-volume data ingestion, processing, and analytics
-
Leverage technologies such as Apache Kafka, Apache Beam, Apache Spark, and Google Cloud Dataflow for streaming and batch processing
Qualifications
Experience
-
8+ years of experience in a Data Platform including Site Reliability Engineering, DevOps, or Systems Engineering role
Technical Skills
-
Strong programming skills in languages such as Python, Java, or similar
-
Experience in developing Data ingestion pipelines, Governance, Quality and automation
-
Proficiency in cloud platforms such as Google Cloud (Mandatory), AWS, Azure
-
Experience in leveraging AI/ML models to enhance efficiency in da