Lead Data Engineer
irissoftware
Job Description
Develop and maintain robust, scalable ETL/ELT pipelines using PySpark on AWS EMR
• Build data ingestion and transformation workflows from diverse sources (S3, EMR, RDS, Kafka, APIs) into AWS-based data lakes and warehouses
• Write clean, modular, testable Python code following best practices and coding standards
• Implement comprehensive unit tests using pytest/unittest with mocking, fixtures, and high code coverage
• Design and build production-grade Airflow DAGs for workflow orchestration, scheduling, and monitoring
• Optimize Spark jobs for performance, memory efficiency, and cost reduction
• Implement CI/CD pipelines for automated testing and deployment using Jenkins, GitHub Actions, or AWS CodePipeline
• Troubleshoot and debug complex data pipeline issues in production environments
• Collaborate with Data Scientists, Analysts, and Platform Engineers to deliver data solutions
• Ensure data quality, security, and compliance standards are met
Required Skills & Qualifications
• 9+ years of hands-on data engineering experience (no management responsibilities required)
• Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
• Expert-level Python programming – OOP, design patterns, clean code practices
• Advanced PySpark/Spark skills – partitioning strategies, shuffle optimization, memory tuning, broadcast joins
• Strong unit testing expertise using pytest/unittest – mocking, parametrized tests, fixtures, TDD mindset
• Hands-on Airflow experience – DAG design, custom operators, sensors, XComs, debugging failed tasks
• Deep AWS experience: S3, EMR, Glue, Redshift, Lambda, Step Functions, IAM, CloudWatch
• Solid understanding of data lake and warehouse architectures (medallion architecture, Delta Lake)
• Strong SQL skills – complex queries, window functions, query optimization
• Proficiency with Git, code reviews, and collaborative development workflows
• Experience with CI/CD pipelines and automated testing frameworks
Nice to Have (Preferred)
• Familiarity with Docker for containerized data workloads
• Exposure to streaming data (Kafka, Spark Streaming)
• Knowledge of data quality frameworks
• Background in financial services or regulated industries
• Understanding of data security and privacy practices (GDPR)