Data Engineer

appliedcomputing

Bengaluru, India NM Years Exp Posted 9h ago

Job Description

  • Deep expertise in PostgreSQL (partitioning, indexing, query optimisation, storage design).

  • Strong proficiency in Python for data processing, scripting, and pipeline orchestration.

  • Hands-on experience with AWS (EKS, S3, EBS, IAM, KMS, CloudWatch, etc.)for secure and scalable data pipelines.

  • Proven ability to work with Databricks and PySpark for large-scale distributed data processing.

  • Familiarity with time-series industrial data (control systems, DCS/SCADA logs, process historians).

  • Experience in unstructured data sync and management within hybrid cloud/on-prem environments.

  • Bonus: Experience working as a data engineer in oil and gas or energy environments

  • Bonus: Knowledge of streaming frameworks (Kafka, Flink, Spark Streaming) or MLOps stacks for data versioning and lineage.

Core Responsibilities

1. Ingest & Contextualise Data

  • Ingest from OPC UA servers, process historians, IoT sensors, LIMS systems, alarms/events, and P&IDs.

  • Map signals to their physical processes (tags, units, hierarchies) for interpretability in AI pipelines.

2. Data Movement & Accessibility

  • Build pipelines that handle real-time streaming and batch ingestion into the Lakehouse.

  • Manage synchronisation between historian archives, unstructured files, and AWS storage (S3/EBS).

  • Orchestrate Databricks Lakeflow/Connectors for integrating data into Lakebase/Lakehouse.

  • Handle secure, high-throughput transfers between historian archives and sandbox/live environments.

3. Change Tracking & Integrity

  • Detect and manage schema changes, signal drift, and inconsistencies acrosstime.

  • Implement lineage and audit trails across Spark/Databricks and AWS pipelines.

4. Data Preparation for AI

  • Build and maintaindual pipelines:

    • Training→ large-scale historical data prep for time-series + LLM training.

    • Inference→ low-latency, real-time pipelines for anomaly detection, optimisation, and LLM search.

  • Support heterogeneous AI workloads (time-series forecasting and retrieval-augmented LLMs).

5. Database Performance & Optimisation

  • Tune PostgreSQLand sparkfor high-throughput time-series workloads (partitioning, indexing, query optimisation).

  • Optimise pipelines for both fast analytical queries and high-efficiency model training.

    • Deploy and manage data pipelines in AWS EKS (Kubernetes) with persisten tEBS-backed storage.

Similar Openings for You