Senior Data Engineer

alight

Chennai, India 5 Years Exp Posted 22d ago

Job Description

Core Responsibilities

  • Design, build, and maintain high‑volume ETL/ELT pipelines across Hadoop (HDFS, Hive, Spark, Kafka) and AWS (Glue, EMR, Lambda, Step Functions, Redshift).
  • Develop distributed data processing solutions using PySpark, Spark SQL, and scalable cloud serverless patterns.
  • Implement reusable data ingestion frameworks for batch (Sqoop, Hive, Spark) and streaming (Kafka, Kinesis).
  • Optimize data workflows using partitioning, bucketing, compression, file formats (Parquet/ORC).
  • Understanding hybrid data lake architectures using S3 + HDFS, ensuring governance consistency (Atlas, Ranger, Lake Formation).
  • Understanding the reporting requirements and perform data profiling and create design for same.
  • Create data flow diagram and do data modelling.
  • Job orchestration using Airflow, Control‑M, Step Functions, or event-driven triggers.
  • Understand auto-scaling, capacity planning, and performance tuning on EMR and Spark clusters.
  • Ensure data is protected and compliant with regulatory standards.
  • Work closely with business stakeholders to enable high‑quality datasets.
  • Provide technical leadership in architecture decisions, code reviews, and best‑practice adoption and provide technical guidance to peers/juniors in team.
  • Improve reliability, scalability, and performance through automation, autoscaling, and capacity planning.
  • Own deployment, incident response, and post-incident reviews for production environments, troubleshooting Spark performance issues, job failures, and cluster bottlenecks.
  • Understanding security best practices (IAM, KMS, security groups, WAF, parameter/secret management).
  • Optimize cost and usage of AWS resources and recommend architecture improvements.
  • Collaborate closely with developers, QA, and product teams to streamline release processes.

Requirements

Technical Skills

  • Strong experience from 5-8 eyars with the Hadoop ecosystem (HDFS, Hive, Spark, YARN, Kafka).
  • Strong hands-on expertise in Scala, PySpark, Spark optimization techniques, HiveQL, and distributed computing.
  • Good work experience in SQL in hive and impala
  • Good understanding of AWS data stack (S3, Glue, EMR, Lambda, Kinesis, Redshift, Step Functions).
  • Proficiency in at least one scripting/programming language: Python, Shell scripting.
  • Strong experience with CI/CDGitHub, Git commands.
  • Expertise in ETL and Data Warehousing and cloud concepts.
  • Good understanding of data modelling (star/snowflake), partitioning strategies, and schema evolution.
  • Expertise in data profiling and decision making.
  • Able to understand, design and create data flow diagrams and do data modelling. (knowledge of Miro will be added advantage)
  • Able to understand the architecture and design end-to-end data flow.
  • Hands-on experience with Airflow, Control‑M, or other orchestrators.
  • To monitor and support BAU and year end activities, if needed.
  • Well versed with security and compliance aspects in Cloud.
  • Good understanding of AWS networking (VPC, subnets, routing, SGs, NACLs).
  • Familiarity with serverless patterns and containerization (Docker, ECS/EKS).
  • Experience with monitoring/logging tools and incident management practices.

Other Requirements

  • Strong logical and analytical, problem-solving, and communication skills.
  • Communicate effectively and concisely with multiple stakeholders and coordinate and collaborate with cross functional teams.
  • Ability to support both legacy Hadoop workloads and cloud-first architectures.
  • AWS certifications (Data Engineer, Solutions Architect, or Developer) are a plus.
    • Good to have health care domain knowledge.

Similar Openings for You