SENIOR SCIENTIST - Machine Learning

happiestminds

Bangalore, 5 Years Exp Posted 69d ago

Job Description

  • Develop and maintain statistical/ML modules (DID, Synthetic Control, A/B Testing, Multi-Treatment Effects) in Python
  • Build and extend FastAPI services and integrate them with our web application via SDK wrappers
  • Design and optimize large-scale data pipelines using PySpark, Delta Lake, and Azure Data Lake
  • Profile and resolve OOM issues in PySpark jobs ? optimize memory allocation, partitioning, broadcast joins, caching strategies, and Spark configurations
  • Deploy and manage workloads on Databricks, including job clusters, notebooks, and Delta Lake tables
  • Containerize and deploy services using Docker, Kubernetes, and CI/CD pipelines
  • Ensure code quality and security via SonarCloud, Snyk, and pytest
  • Collaborate with data scientists and product teams to translate research into production-ready modules

Must-Have Skills

  • Python (3.9+) ? 3+ years of production experience
  • PySpark & Spark Internals ? strong experience with Spark memory model, executor tuning, shuffle optimization, and diagnosing/resolving OOM errors (broadcast thresholds, partition skew, spill-to-disk, GC tuning)
  • Databricks ? hands-on with job orchestration, cluster configuration, notebook workflows, and Delta Lake optimization (Z-ordering, compaction, caching)
  • Causal Inference & Experimentation ? DID, synthetic control, A/B testing, hypothesis testing, panel data methods
  • Statistics/ML Libraries ? statsmodels, scikit-learn, scipy, pandas, numpy
  • API Development ? building RESTful services with FastAPI (or similar)
  • Cloud (Azure) ? Azure Storage, Azure ML, Data Lake
  • Docker & Kubernetes ? containerization and orchestration for ML workloads
  • Testing ? writing robust unit/integration tests with pytest

Nice-to-Have

  • Experience with Celery/Redis for async task orchestration
  • Familiarity with Polars, PyArrow, or SQLAlchemy
  • Background in econometrics or experimental design
  • Spark UI profiling and performance benchmarking
  • CI/CD tooling (SonarCloud, Snyk, GitHub Actions)

What Sets You Apart

  • You can look at a Spark execution plan and pinpoint why a job is OOM-ing
  • You think in modules ? clean separation of data processing, inference, and post-processing
  • You can go from a Jupyter notebook prototype to a production-grade, testable service
  • You're comfortable with both statistical rigor and software engineering best practices

Similar Openings for You