Harish Kesava Rao
I build the data infrastructure that powers AI applications in production — lakehouse architecture, embedding pipelines, vector storage, and large-scale Spark data platforms. Over 12 years at Atlassian, Databricks, Amazon, Salesforce, and Indeed, I have shipped systems handling petabyte-scale data across AWS and Azure. My work sits at the intersection of data engineering and AI/ML enablement — from high-throughput ingestion to RAG frameworks, semantic search, and LLM data pipelines. I am an active open-source contributor to Apache Airflow, Delta Lake, and DataHub, and I write about data engineering and AI infrastructure on this site.
Research interests. I am drawn to the open questions at the intersection of data engineering and machine learning: How do we design lakehouse storage formats and compaction strategies that remain efficient as embedding dimensions grow and vector indices must be refreshed at streaming latencies? What consistency and fault-tolerance guarantees does a distributed retrieval layer need to serve RAG pipelines reliably under production skew? How do scheduling and resource-allocation decisions in a Spark cluster change when the downstream consumer is an LLM endpoint? These questions sit at the intersection of systems research and applied ML infrastructure, and they motivate both my open-source work and the problems I choose to write about.
news
| Oct 29, 2025 |
| |
|---|---|---|
| Mar 29, 2025 | [Talks] Guest lecture to Undergraduate students and faculty of an Engineering College’s Department of Artificial Intelligence and Data Science. Topic: Building a career in Data | |
| Apr 29, 2024 | [Update] Joined Atlassian India as Principal Data Engineer & Data Architect. | |
| Apr 30, 2023 | [Open Source] Created the Databricks Partition Sensor (for the Databricks Provider) for Apache Airflow. | |
| Apr 2, 2023 | [Open Source] First major contribution to Apache Airflow – Databricks SQL Sensor for Airflow. |