Staff Machine Learning Engineer
greenhouse
Job Description
Why this Role Matters
- Accelerate the rollout of LLM-powered and agent-driven features across Tekion products.
- Enable agentic workflows that automate, reason, and interact on behalf of users and internal stakeholders.
- Operationalize secure, compliant, and explainable LLM and agentic services at scale.
- Convert Applied Sciences models into scalable, compliant, cost‑efficient production services.
- Standardize how models are trained, validated, deployed, and monitored across Tekion products.
- Power real-time, context-aware experiences by integrating batch/stream features, graph context, and online inference.
What You’ll Do
- Turn Applied Sciences prototype models (tabular, NLP/LLM, recommendation, forecasting) into fast, reliable services with well-defined API contracts.
- Integrate with the LLM Gateway/MCP, prompt/config versioning.
- Build and orchestrate CI/CD pipelines.
- Review data science models; refactor and optimize code; containerize; deploy; version; and monitor for quality.
- Collaborate with data scientists, data engineers, product managers, and architects to design enterprise systems.
- Monitor, detect, and mitigate risks unique to LLMs and agentic systems.
- Implement prompt management: versioning, A/B testing, guardrails, and dynamic orchestration based on feedback and metrics.
- Design batch/stream pipelines (Airflow/Kubeflow, Spark/Flink, Kafka) and online features linked to our domain graph.
- Build inference microservices (REST/gRPC) with schema versioning, structured outputs, and stringent p95 latency targets.
- Manage the model/feature lifecycle: feature store strategy, model/agent registry, versioning, and lineage.
- Instrument deep observability: traces/logs/metrics, data/feature drift, model performance, safety signals, and cost tracking.
- Ensure real-time reliability: autoscaling, caching, circuit breakers, retries/fallbacks, and graceful degradation.
- Develop templates/SDKs/CLIs, sandbox datasets, and documentation that make shipping ML the default path.
Desired Skills and Experience
- 8 - 10 years in ML engineering/MLOps or backend/platform engineering with production ML.
- Experience with LLMs, retrieval systems, vector stores, and graph/knowledge stores.
- Strong software engineering fundamentals: Python plus one of Java/Go/Scala; API design; concurrency; testing.
- Hands-on with orchestration frameworks and libraries (LangChain, LlamaIndex, OpenAI Function Calling, AgentKit, etc.).
- Knowledge of agent architectures (reactive, planning, retrieval-augmented agents), and safe execution patterns.
- Pipelines and data: Airflow/Kubeflow or similar; Spark/Flink; Kafka/Kinesis; strong data quality practices.
- Microservices and runtime: Docker/Kubernetes, service meshes, REST/gRPC; performance and reliability engineering.
- Model ops: experiment tracking, registries (e.g., MLflow), feature stores, A/B and shadow testing, drift detection.
- Observability: OpenTelemetry/Prometheus/Grafana; debugging latency, tail behavior, and memory/CPU hotspots.
- Cloud: AWS preferred (IAM, ECS/EKS, S3, RDS/DynamoDB, Step Functions/Lambda), with cost optimization experience.
- Security/compliance: secrets management, RBAC/ABAC, PII handling, auditability.
Preferred Mindset
- Product-oriented: You measure success by dealer and consumer outcomes, not just technical metrics.
- Reliability- and safety-first: You move fast with guardrails, rollbacks, and clear SLOs.
- Systems thinker: You design for multi-tenant scale, portability, and cost efficiency.
- Collaborative: You translate between Applied Sciences, Product, and the Data & AI Platform; you document and teach.
- Pragmatic: You automate the 80% and leave room for rapid experimentatio