Data Engineer
appliedcomputing
Job Description
-
Deep expertise in PostgreSQL (partitioning, indexing, query optimisation, storage design).
-
Strong proficiency in Python for data processing, scripting, and pipeline orchestration.
-
Hands-on experience with AWS (EKS, S3, EBS, IAM, KMS, CloudWatch, etc.)for secure and scalable data pipelines.
-
Proven ability to work with Databricks and PySpark for large-scale distributed data processing.
-
Familiarity with time-series industrial data (control systems, DCS/SCADA logs, process historians).
-
Experience in unstructured data sync and management within hybrid cloud/on-prem environments.
-
Bonus: Experience working as a data engineer in oil and gas or energy environments
-
Bonus: Knowledge of streaming frameworks (Kafka, Flink, Spark Streaming) or MLOps stacks for data versioning and lineage.
Core Responsibilities
1. Ingest & Contextualise Data
-
Ingest from OPC UA servers, process historians, IoT sensors, LIMS systems, alarms/events, and P&IDs.
-
Map signals to their physical processes (tags, units, hierarchies) for interpretability in AI pipelines.
2. Data Movement & Accessibility
-
Build pipelines that handle real-time streaming and batch ingestion into the Lakehouse.
-
Manage synchronisation between historian archives, unstructured files, and AWS storage (S3/EBS).
-
Orchestrate Databricks Lakeflow/Connectors for integrating data into Lakebase/Lakehouse.
-
Handle secure, high-throughput transfers between historian archives and sandbox/live environments.
3. Change Tracking & Integrity
-
Detect and manage schema changes, signal drift, and inconsistencies acrosstime.
-
Implement lineage and audit trails across Spark/Databricks and AWS pipelines.
4. Data Preparation for AI
-
Build and maintaindual pipelines:
-
Training→ large-scale historical data prep for time-series + LLM training.
-
Inference→ low-latency, real-time pipelines for anomaly detection, optimisation, and LLM search.
-
-
Support heterogeneous AI workloads (time-series forecasting and retrieval-augmented LLMs).
5. Database Performance & Optimisation
-
Tune PostgreSQLand sparkfor high-throughput time-series workloads (partitioning, indexing, query optimisation).
-
Optimise pipelines for both fast analytical queries and high-efficiency model training.
-
Deploy and manage data pipelines in AWS EKS (Kubernetes) with persisten tEBS-backed storage.
-