Join Processing: When Big Data was Just Very Large

Big Data Series

Professor Christopher Re
Professor, Stanford University
Given on: Jan. 16th, 2014


Most data processing systems are based on the relational model. An operation called the relational join is the heart of relational data processing systems. For the last four decades, join processing has been the subject of intense academic and commercial interest. Recently, with Ngo, Porat, and Rudra, I developed the first join algorithm with worst-case optimal running time. Our algorithm is simple, but its analysis exploits a connection between join processing, constraint satisfaction, and discrete geometry including generalizations of the Loomis-Whitney inequality. The first part of the talk will give an overview of relational database research, its associated industry, and join processing.

While theoretically interesting (and the de facto complexity measure in computer science), worst-case analysis may not accurately reflect practice in all cases. Inspired by conditioning from numerical optimization and the “beyond worst-case analysis” push in the computer science theory community, we describe a notion called certificate complexity that allows us to measure the runtime complexity of join algorithms at much finer levels than worst-case analysis. I will describe an algorithm called Minesweeper that provide stronger optimality guarantees based on certificate complexity. Assuming the 3SUM conjecture, Minesweeper achieves our strongest notion of optimality for the largest possible class of join queries (called beta-acyclic queries).

Due to my upbringing, these algorithms have been (or are being) implemented with help from industry. I may describe some preliminary experimental results.

One of the top conferences in my research area is called “VLDB” for Very Large Databases. VLDB 2014 will be the fortieth annual meeting.