Advanced Data Engineer

oraclecloud

Bengaluru, India 2 Years Exp Posted 7d ago

Job Description

Data Pipelines & Ingestion

  • Implement end-to-end ingestion pipelines from heterogeneous sources (i.e. Snowflake, SQL Server, Excel, REST APIs, and unstructured files) into Azure Databricks following defined architecture patterns
  • Build and maintain Bronze → Silver → Gold Medallion layers, applying transformation logic, business rules, and quality checks at each stage
  • Implement incremental loading pattern (i.e. CDC, watermarking, Delta Lake MERGE/UPSERT) to ensure efficient, scalable, and reliable data delivery
  • Develop pipelines for structured and unstructured data (i.e. documents, JSON, Parquet, Excel) supporting AI and ML consumption downstream

Data Modeling & Semantic Layer

  • Implement and extend data models (i.e. fact/dimension tables, domain data marts) following designs defined by the Senior DE and AI team.
  • Write clean, modular, reusable PySpark and SQL transformation logic that is testable, documented, and deployable via CI/CD
  • Contribute to the semantic layer that powers Power BI dashboards and GCP-connected analytics consumers
  • Maintain and improve existing models as business requirements evolve

Orchestration and Data Ops

  • Build and manage Databricks Workflows: configuring task dependencies, retry policies, and failure alerting
  • Follow and contribute to CI/CD practices: version control, pull requests, automated testing, and deployment to Dev/QA/Prod environments using Azure DevOps or GitHub Actions
  • Package and deploy reusable logic as Python libraries following team standards
  • Monitor pipeline health, investigate failures, and resolve data issues within SLA

Data Governance & Quality

  • Apply data quality rules (i.e. validation, deduplication, null checks, reconciliation) within pipelines to ensure data arrives fit for purpose
  • Operate within the Unity Catalog governance framework respecting RBAC, namespace structure, and tagging standards defined by platform leads
  • Ensure data delivered to GCP is schema-consistent, validated, and documented
  • Flag and escalate data quality issues proactively not reactively

FinOps Awareness

  • Write cost-conscious PySpark avoiding unnecessary full scans, optimizing joins, using appropriate cluster types
  • Apply Delta table best practices (i.e. VACUUM, OPTIMIZE, compaction) to manage storage costs
  • Follow cluster policies defined by platform leads and flag unusual resource consumption

 

Must Have

  • Databricks:  2+ years hands-on: PySpark, Delta Lake, Workflows, Unity Catalog.
  • Demonstrate expertise in data strategy, for example: Medallion Architecture, Domain Data Modeling and Functional Data Architecture.
  • Data Quality Frameworks (i.e. rule-based validation, anomaly detection)
  • Data Pipelines: incremental loading, CDC, CI/CD, Observability
  • Advanced Python/Pyspark and Advanced SQL
  • Strongly preferred: DLT, UC, GCP, Azure, Kafka.
    • Highly value Databricks Certified Professional

Similar Openings for You