Data Engineer

appliedcomputing

Bengaluru, India NM Years Exp Posted 9h ago

Deep expertise in PostgreSQL (partitioning, indexing, query optimisation, storage design).
Strong proficiency in Python for data processing, scripting, and pipeline orchestration.
Hands-on experience with AWS (EKS, S3, EBS, IAM, KMS, CloudWatch, etc.)for secure and scalable data pipelines.
Proven ability to work with Databricks and PySpark for large-scale distributed data processing.
Familiarity with time-series industrial data (control systems, DCS/SCADA logs, process historians).
Experience in unstructured data sync and management within hybrid cloud/on-prem environments.
Bonus: Experience working as a data engineer in oil and gas or energy environments
Bonus: Knowledge of streaming frameworks (Kafka, Flink, Spark Streaming) or MLOps stacks for data versioning and lineage.

Core Responsibilities

1. Ingest & Contextualise Data

Ingest from OPC UA servers, process historians, IoT sensors, LIMS systems, alarms/events, and P&IDs.
Map signals to their physical processes (tags, units, hierarchies) for interpretability in AI pipelines.

2. Data Movement & Accessibility

Build pipelines that handle real-time streaming and batch ingestion into the Lakehouse.
Manage synchronisation between historian archives, unstructured files, and AWS storage (S3/EBS).
Orchestrate Databricks Lakeflow/Connectors for integrating data into Lakebase/Lakehouse.
Handle secure, high-throughput transfers between historian archives and sandbox/live environments.

3. Change Tracking & Integrity

4. Data Preparation for AI

Build and maintaindual pipelines:
- Training→ large-scale historical data prep for time-series + LLM training.
- Inference→ low-latency, real-time pipelines for anomaly detection, optimisation, and LLM search.
Support heterogeneous AI workloads (time-series forecasting and retrieval-augmented LLMs).

5. Database Performance & Optimisation

Tune PostgreSQLand sparkfor high-throughput time-series workloads (partitioning, indexing, query optimisation).
Optimise pipelines for both fast analytical queries and high-efficiency model training.
- Deploy and manage data pipelines in AWS EKS (Kubernetes) with persisten tEBS-backed storage.