Senior Software Engineer
topgeek
Job Description
Key Responsibilities
Architecture & System Design
- Architect and deploy end-to-end AI systems — from data pipelines to model serving.
- Design modular SDKs for multi-provider AI integration (OpenAI, Claude, Gemini, LLaMA).
- Lead decision-making on cloud vs self-hosted LLM deployment (Ollama, vLLM, TGI).
- Guide infrastructure design for scalability, observability, and cost efficiency using GPU clusters, Ray, or KServe.
- Collaborate with backend, MLOps, and infra teams to ensure high availability and low latency across AI workloads.
Core ML / DL Development
- Train and fine-tune models (CNN, RNN, Transformers) across text, vision, and speech domains.
- Implement LoRA / PEFT fine-tuning for custom LLMs, embedding models, and instruction-tuned variants.
- Work with open-source and proprietary model repositories (Hugging Face, Kaggle, Hugging Face Spaces).
- Optimize model architectures for inference performance, quantization, and memory efficiency.
- Conduct A/B testing, cross-validation, and human evaluation on model outputs.
- Build internal evaluation benchmarks and dataset management pipelines for consistent model scoring and comparison.
Data & Dataset Engineering
- Curate, clean, and version-control datasets for text, image, and audio modalities.
- Build pipelines for data labelling, augmentation, and validation using Airflow / Prefect.
- Create and manage feature stores, embedding repositories, and dataset registries.
- Leverage open datasets (e.g., Common Crawl, LAION, OpenImages, LibriSpeech) and integrate custom enterprise datasets.
- Ensure data governance, bias checks, and PII anonymization using Presidio or custom filters.
AI Ops & Deployment
- Automate model workflows with MLflow, Kubeflow, or Vertex AI for experiment tracking and versioning.
- Lead model deployment with vLLM, TGI, or TorchServe, ensuring optimized GPU/TPU utilization.
- Set up continuous evaluation pipelines for model drift, bias, and quality decay using EvidentlyAI and Prometheus.
- Leverage open datasets (e.g., Common Crawl, LAION, OpenImages, LibriSpeech) and integrate custom enterprise datasets.
- Drive adoption of model registries and model cards for transparency and reproducibility.
Team & Technical Leadership
- Mentor and review the work of AI/ML Engineers I & II.
- Collaborate with product, design, and research teams to translate business needs into AI roadmaps.
- Lead POCs and experiments for emerging AI verticals (e.g., multimodal, video, robotics, IoT intelligence).
- Present internal demos, AI reports, and architectural documentation to leadership and clients
Core Skills Required
- Programming: Expert-level Python, with a deep understanding of OOP, async, and design patterns
- Frameworks: PyTorch, TensorFlow, Hugging Face Transformers, LangChain,LlamaIndex.
- Model Ops: MLflow, KServe, TorchServe, vLLM, TGI.
- Data Stack: Airflow / Prefect, pgvector, Milvus, Pinecone, FOSS, PostgreSQL.
- Infra: Docker, Kubernetes, Ray, GPU servers, Cloud AI (Vertex AI, Bedrock, Azure).
- Evaluation & Metrics: Familiarity with BLEU, ROUGE, and latency/throughput metrics for AI models.
- Security: Secure Vaults, Microsoft Presidio, Fairlearn / AIF360 awareness for data and bias governance.
Good-to-Have Skills
- Experience with distributed training, quantization, and mixed-precision optimization.
- Experience with model compression, distillation, or low-rank adaptation for efficiency.
- Contribution to open-source AI frameworks or Hugging Face Spaces.
- Research exposure in LLM alignment, prompt optimization, or multimodal reasoning.
- Understanding of AI cost governance, observability, and MLOps automation.