Senior Data Engineer
alight
Job Description
Core Responsibilities
- Design, build, and maintain high‑volume ETL/ELT pipelines across Hadoop (HDFS, Hive, Spark, Kafka) and AWS (Glue, EMR, Lambda, Step Functions, Redshift).
- Develop distributed data processing solutions using PySpark, Spark SQL, and scalable cloud serverless patterns.
- Implement reusable data ingestion frameworks for batch (Sqoop, Hive, Spark) and streaming (Kafka, Kinesis).
- Optimize data workflows using partitioning, bucketing, compression, file formats (Parquet/ORC).
- Understanding hybrid data lake architectures using S3 + HDFS, ensuring governance consistency (Atlas, Ranger, Lake Formation).
- Understanding the reporting requirements and perform data profiling and create design for same.
- Create data flow diagram and do data modelling.
- Job orchestration using Airflow, Control‑M, Step Functions, or event-driven triggers.
- Understand auto-scaling, capacity planning, and performance tuning on EMR and Spark clusters.
- Ensure data is protected and compliant with regulatory standards.
- Work closely with business stakeholders to enable high‑quality datasets.
- Provide technical leadership in architecture decisions, code reviews, and best‑practice adoption and provide technical guidance to peers/juniors in team.
- Improve reliability, scalability, and performance through automation, autoscaling, and capacity planning.
- Own deployment, incident response, and post-incident reviews for production environments, troubleshooting Spark performance issues, job failures, and cluster bottlenecks.
- Understanding security best practices (IAM, KMS, security groups, WAF, parameter/secret management).
- Optimize cost and usage of AWS resources and recommend architecture improvements.
- Collaborate closely with developers, QA, and product teams to streamline release processes.
Requirements
Technical Skills
- Strong experience from 5-8 eyars with the Hadoop ecosystem (HDFS, Hive, Spark, YARN, Kafka).
- Strong hands-on expertise in Scala, PySpark, Spark optimization techniques, HiveQL, and distributed computing.
- Good work experience in SQL in hive and impala
- Good understanding of AWS data stack (S3, Glue, EMR, Lambda, Kinesis, Redshift, Step Functions).
- Proficiency in at least one scripting/programming language: Python, Shell scripting.
- Strong experience with CI/CD, GitHub, Git commands.
- Expertise in ETL and Data Warehousing and cloud concepts.
- Good understanding of data modelling (star/snowflake), partitioning strategies, and schema evolution.
- Expertise in data profiling and decision making.
- Able to understand, design and create data flow diagrams and do data modelling. (knowledge of Miro will be added advantage)
- Able to understand the architecture and design end-to-end data flow.
- Hands-on experience with Airflow, Control‑M, or other orchestrators.
- To monitor and support BAU and year end activities, if needed.
- Well versed with security and compliance aspects in Cloud.
- Good understanding of AWS networking (VPC, subnets, routing, SGs, NACLs).
- Familiarity with serverless patterns and containerization (Docker, ECS/EKS).
- Experience with monitoring/logging tools and incident management practices.
Other Requirements
- Strong logical and analytical, problem-solving, and communication skills.
- Communicate effectively and concisely with multiple stakeholders and coordinate and collaborate with cross functional teams.
- Ability to support both legacy Hadoop workloads and cloud-first architectures.
- AWS certifications (Data Engineer, Solutions Architect, or Developer) are a plus.
- Good to have health care domain knowledge.