Advanced Data Engineer
oraclecloud
Job Description
Data Pipelines & Ingestion
- Implement end-to-end ingestion pipelines from heterogeneous sources (i.e. Snowflake, SQL Server, Excel, REST APIs, and unstructured files) into Azure Databricks following defined architecture patterns
- Build and maintain Bronze → Silver → Gold Medallion layers, applying transformation logic, business rules, and quality checks at each stage
- Implement incremental loading pattern (i.e. CDC, watermarking, Delta Lake MERGE/UPSERT) to ensure efficient, scalable, and reliable data delivery
- Develop pipelines for structured and unstructured data (i.e. documents, JSON, Parquet, Excel) supporting AI and ML consumption downstream
Data Modeling & Semantic Layer
- Implement and extend data models (i.e. fact/dimension tables, domain data marts) following designs defined by the Senior DE and AI team.
- Write clean, modular, reusable PySpark and SQL transformation logic that is testable, documented, and deployable via CI/CD
- Contribute to the semantic layer that powers Power BI dashboards and GCP-connected analytics consumers
- Maintain and improve existing models as business requirements evolve
Orchestration and Data Ops
- Build and manage Databricks Workflows: configuring task dependencies, retry policies, and failure alerting
- Follow and contribute to CI/CD practices: version control, pull requests, automated testing, and deployment to Dev/QA/Prod environments using Azure DevOps or GitHub Actions
- Package and deploy reusable logic as Python libraries following team standards
- Monitor pipeline health, investigate failures, and resolve data issues within SLA
Data Governance & Quality
- Apply data quality rules (i.e. validation, deduplication, null checks, reconciliation) within pipelines to ensure data arrives fit for purpose
- Operate within the Unity Catalog governance framework respecting RBAC, namespace structure, and tagging standards defined by platform leads
- Ensure data delivered to GCP is schema-consistent, validated, and documented
- Flag and escalate data quality issues proactively not reactively
FinOps Awareness
- Write cost-conscious PySpark avoiding unnecessary full scans, optimizing joins, using appropriate cluster types
- Apply Delta table best practices (i.e. VACUUM, OPTIMIZE, compaction) to manage storage costs
- Follow cluster policies defined by platform leads and flag unusual resource consumption
Must Have
- Databricks: 2+ years hands-on: PySpark, Delta Lake, Workflows, Unity Catalog.
- Demonstrate expertise in data strategy, for example: Medallion Architecture, Domain Data Modeling and Functional Data Architecture.
- Data Quality Frameworks (i.e. rule-based validation, anomaly detection)
- Data Pipelines: incremental loading, CDC, CI/CD, Observability
- Advanced Python/Pyspark and Advanced SQL
- Strongly preferred: DLT, UC, GCP, Azure, Kafka.
- Highly value Databricks Certified Professional