Workshop Program

Session 1: Storage Layout, Compaction, and Pruning

Session Chair: Muthunagappan Muthuraman (Snowflake)

9:15 – 9:45

Invited Talk

Pedro Pedreira

Software Engineer at Meta

Scaling AI Training: Storage Layout Challenges

Abstract and speaker bio

Abstract. Modern AI training pipelines demand efficient data loading and preparation; the storage layout layer plays an important role in the performance of both the read and write paths. This talk explores challenges in designing and evolving storage formats optimized for training workloads, discusses feature storage and processing, implications of wide tables, and challenges with normalization. It also presents Nimble, Meta’s columnar file format, some of its main design decisions, features, and discusses future challenges and areas of exploration.

About the speaker. Pedro Pedreira is a Software Engineer at Meta, where he has spent over 13 years working on large-scale compute, storage, and query processing systems. Pedro leads Velox, a cross-organizational effort involving 20+ companies aimed at unifying execution engines using an open-source library, in addition to a variety of related efforts aimed at modernizing compute engines, more recently focused on AI training. Prior to Velox, Pedro led the creation of Cubrick and has contributed to a series of data open source infrastructure projects including Presto, Spark, Gluten, Nimble, Arrow, and others. His work focuses on building high-performance, composable execution libraries that serve as the foundation for analytical and AI workloads at scale.

9:45 – 10:00

Commutative CompactionACM DL

Chris Douglas (UC Berkeley), Joseph Hellerstein (UC Berkeley)

10:00 – 10:15

Amethyst: Adaptive Compaction for LSM Trees via Segment-Level Policy SelectionACM DL

Suchitra Shankar (PES University), Nilin Rose (Jain University)

10:15 – 10:30

What No One Tells You About Page Pruning in Parquet: The Real Cost of Page Index ParsingACM DL

Faeze Faghih (Technical University of Darmstadt), Si Jun Kwon (Technical University of Darmstadt), Zsolt István (Technical University of Darmstadt)

Session 2: Lakehouse Architecture and Cross-Format Interoperability

Session Chair: Jignesh Patel (Carnegie Mellon University)

11:00 – 12:00

Keynote

Raghu Ramakrishnan

CTO for Azure Data and Technical Fellow at Microsoft

Evolving Data Lakes for an Agentic Future

Abstract and speaker bio

Abstract. Data lakes have become the hub of modern integrated analytics platforms, supporting workloads that span interactive exploration to large-scale warehousing and real-time monitoring. This landscape is evolving yet again: alongside human-driven analytics, data lakes increasingly serve agents performing a wide spectrum of tasks. This shift is driving query rates much higher, with workloads characterized by highly concurrent, often overlapping requests, and often fine-grained, short-lived queries. At the same time, expectations around performance and reliability have risen considerably. These trends are reshaping the architecture of data lakes. In this talk, we look at how data lakes are evolving to meet these new demands, and paint a forward-looking picture of where they are headed.

About the speaker. Raghu Ramakrishnan is CTO for Data and a Technical Fellow at Microsoft. From 1987 to 2006, he was a professor at University of Wisconsin-Madison, where he wrote the widely used text “Database Management Systems”. In 1999, he founded QUIQ, a company powering crowd-sourced question-answering as a cloud service. He joined Yahoo! in 2006 as a Yahoo! Fellow, and served as Chief Scientist for the portal, cloud and search divisions. Ramakrishnan has received several awards, including the ACM SIGMOD Edgar F. Codd Innovations Award, the ACM SIGKDD Innovations Award, the IEEE TCDE Elmasri Outstanding Database Education Award, the ACM SIGMOD Contributions Award, 10-year Test-of-Time Awards from the ACM SIGMOD, ACM SOCC, ICDT and VLDB conferences, the IIT Madras Distinguished Alumnus Award, the NSF Presidential Young Investigator Award, and the Packard Fellowship in Science and Engineering. He is a Fellow of the ACM and IEEE and has served as Chair of ACM SIGMOD.

12:00 – 12:15

The Lance Lakehouse FormatACM DL

Ayush Chaurasia (LanceDB), Jack Ye (LanceDB), Lu Qiu (LanceDB), Lei Xu (LanceDB), Weston Pace (LanceDB)

12:15 – 12:30

Polyglot: An LLM-Driven Semantic Control Plane for Cross-Format Type Interoperability Across Lakehouse FormatsACM DL

Aastha Agrrawal (LinkedIn), Sumedh Sakdeo (LinkedIn), Afzal Afzal (LinkedIn), Ruolin Fan (LinkedIn), Kunal Narula (LinkedIn), Lenisha Gandhi (LinkedIn)

Session 3: Metadata, Consistency, and Maintenance in Production Lakehouses

Session Chair: Harshad Deshmukh (Google)

1:30 – 2:00

Invited Talk

Sumedh Sakdeo

Principal Staff Software Engineer at LinkedIn

Full-Stack Lakehouse Architecture at Exabyte Scale: Lessons and Research Opportunities from LinkedIn's OpenHouse-Managed Iceberg

Abstract and speaker bio

Abstract. The lakehouse promise is simple: persist data once in an open format, query it from anywhere. Lakehouses are a real step forward — open formats, ACID transactions, multi-engine access. But persisting data once is solved. It breaks when that data gets served.

In open lakehouses, there is no narrow API enforcing correctness inline. The data plane is format libraries embedded in N engines; the catalog sees metadata but never the bytes. Every gap in this architecture compounds silently at scale.

We describe five research frontiers surfaced by operating OpenHouse — LinkedIn's open-source declarative control plane for Apache Iceberg — across 300,000+ tables and over one exabyte of managed data. (1) Interoperability: the same bits on disk can look different when queried from different engines — types, schemas, and semantics diverge silently across the stack. (2) The online/offline divide: the industry-wide cost of duplicating data between operational and analytical systems via CDC pipelines, and the path toward WAL-centric convergence. (3) Agent guardrails: as LLM-driven agents write to production data, data engineering needs the same branch-validate-merge safety model that software engineering solved decades ago. (4) Data layout optimization: compaction, sort order, partitioning, and write throughput are coupled objectives with no single global optimum, building on our AutoComp framework (SIGMOD '25). (5) Incremental processing: real-time freshness is well-served by Kafka stream processing, and completeness is well-served by Spark batch over verified data — but the middle ground of near-real-time freshness remains underserved, where streaming becomes prohibitively expensive and batch lacks the tooling to consume incremental deltas efficiently.

For each frontier, we present the problem at production scale and pose open research questions for the data community.

About the speaker. Sumedh Sakdeo is a Principal Staff Software Engineer at LinkedIn. He has 15+ years of experience working in data infrastructure. He previously contributed to a log-structured file system at Tintri and led data infrastructure, engineering, and visualization for Lyft's self-driving division. Notably, at LinkedIn, he spearheaded the development of OpenHouse, a pioneering control plane for open-source data lakehouse deployments.

2:00 – 2:15

Zero-Scan Data Quality: Leveraging Table Format Metadata for Continuous Observability at ScaleACM DL

Mohit Verma (LinkedIn), Shantanu Rawat (LinkedIn), Christian Bush (LinkedIn), Sumedh Sakdeo (LinkedIn), Lokesh Amarnath Ravindranathan (LinkedIn), Dwarak Bakshi (LinkedIn)

2:15 – 2:30

How Consistent and Fresh are Lake Tables ReallyACM DL

Zinuo Li (Renmin University of China), Dongyang Geng (Renmin University of China), Haoyue Li (Renmin University of China), Hailong Yu (eDaijia Automobile Technology), Qi Lei (Renmin University of China), Haoqiong Bian (Renmin University of China)

2:30 – 2:45

Metadata-as-Data in Apache Hudi: A Multi-Modal Index Substrate for Lakehouse TablesACM DL

Sagar Sumit (Anyscale), Prashant Wason (Uber), Sivabalan Narayanan (Onehouse)

2:45 – 3:00

Table Format Optimizations in Managed Spark for DataprocACM DL

Isha Tarte (Google), Jayadeep Jayaraman (Google), Abhishek Modi (Google), Vishal Karve (Google), Jingwei Lu (Google), Sourabh Badhya (Google), Rajarshi Sarkar (Google), Aditya Shah (Google), Haymant Mangla (Google), Zihan Cao (Google), Huadong Liu (Google), Warren Zhu (Google), Wei Yan (OpenAI)

Session 4: Emerging Directions and the Road Ahead

Session Chair: Ashvin Agrawal (Microsoft)

3:30 – 4:00

Invited Talk

Will Manning

Co-founder/CEO at Spiral

Turtles All the Way Down: Composability & File Formats

Abstract and speaker bio

Abstract. The Composable Data Management System Manifesto outlined a framework for decomposing data systems into reusable parts. This talk will argue that the same framework should be extended downwards into file formats themselves, and explain how these principles have shaped the evolution of Vortex.

About the speaker. Will Manning is co-founder & CEO of Spiral, and the chair of the Vortex project (hosted by the Linux Foundation). Prior to Spiral, Will spent a decade at Palantir, where he was one of the creators of Palantir Foundry and built/scaled "everything that read or wrote bytes." In a past life, he also was a Reinforcement Learning researcher and a derivatives quant. Personal website.

4:00 – 4:55

Panel Discussion

Does One Size Still Fit All? Data Formats in a Multi Engine, Agentic Stack

Moderator

Ashvin Agrawal

Principal RSDE at Microsoft Gray System Lab

Background

Ashvin has over two decades of experience in building large-scale distributed systems. He has contributed to leading products and open-source projects. He is a PMC member of Apache XTable and Polaris, and has previously served on the PMC of Apache Geode and Heron. Currently, he is a Research Engineer at Microsoft Gray System Lab (GSL), where he focuses on data storage, databases, streaming technologies, and data provenance. Ashvin has previously held senior positions at VMware, Yahoo, and AirTight Networks. He has authored numerous scholarly articles and holds an M.Tech. in Computer Science from IIT Kanpur.

Panelists

Jignesh Patel

Professor at Carnegie Mellon University

Background

Jignesh Patel is a professor in the Computer Science Department at Carnegie Mellon University. His research focuses on scalable data platforms and the use of AI to improve how people interact with those platforms. He is a Fellow of AAAS, ACM, and IEEE. He is also deeply interested in translating university research into real-world impact, having spun out four startups from his research group.

Madhan Gajendran

Partner Architect at Microsoft

Background

Since 2006, Madhan has worked on various projects at Microsoft spanning Windows, Hyper-V (Virtualization platform), SQL Server, Azure Cosmos DB and recently he has been innovating in the intersection of modern OLTP, Search and OLAP systems. His primary interest at work is in engineering large scale data systems and platforms that empower Microsoft partners and customers.

Nitin Agrawal

Software Engineer at Google

Background

Nitin Agrawal is a Software Engineer at Google working on analytical storage and database systems. His past work on distributed and storage systems has received multiple best–paper awards, a USENIX Test of Time award, an outstanding patent award, and coverage in popular media. He served as the program-committee chair for USENIX FAST ’18 and earned his doctorate in Computer Science from the University of Wisconsin - Madison. Personal webpage.

Pedro Pedreira

Software Engineer at Meta

Background

Pedro Pedreira is a Software Engineer at Meta, where he has spent over 13 years working on large-scale compute, storage, and query processing systems. Pedro leads Velox, a cross-organizational effort involving 20+ companies aimed at unifying execution engines using an open-source library, in addition to a variety of related efforts aimed at modernizing compute engines, more recently focused on AI training. Prior to Velox, Pedro led the creation of Cubrick and has contributed to a series of data open source infrastructure projects including Presto, Spark, Gluten, Nimble, Arrow, and others. His work focuses on building high-performance, composable execution libraries that serve as the foundation for analytical and AI workloads at scale.

Will Manning

Co-founder & CEO at Spiral

Background

Will Manning is co-founder & CEO of Spiral, and the chair of the Vortex project (hosted by the Linux Foundation). Prior to Spiral, Will spent a decade at Palantir, where he was one of the creators of Palantir Foundry and built/scaled "everything that read or wrote bytes." In a past life, he also was a Reinforcement Learning researcher and a derivatives quant. Personal website.