Database-Supported Haskell, DSH for short, is a Haskell library for database-supported program execution. Using the DSH library, a relational database management system (RDBMS) can be used as a coprocessor for the Haskell programming language, especially for those program fragments that carry out data-intensive and data-parallel computations. Rather than embedding a relational language into Haskell, DSH turns idiomatic Haskell programs into SQL queries.
We have used DSH for large scale data analysis. Specifically, in collaboration with researchers working in social and economic sciences, we used DSH to analyse the entire history of Wikipedia (terabytes of data) and a number of online forum discussions (gigabytes of data). Because of the scale of the data, it would be unthinkable to conduct the data analysis in Haskell without using the database-supported program execution technology featured in DSH. We have formulated several DSH queries directly in SQL as well and found that the equivalent DSH queries were much more concise, easier to write and maintain (mostly due to DSH’s support for nesting, Haskell’s abstraction facilities and the monad comprehension notation, see below). One long-term goal is to allow researchers who are not necessarily expert programmers or database engineers to conduct large scale data analysis themselves.
As of today, DSH relies on a query compilation strategy coined loop-lifting. Loop-lifting comes with important and desirable properties (e.g., the number of SQL queries issued for a given DSH program only depends on the static type of the program’s result). The strategy, however, relies on a rather complex and monolithic mapping of programs to the relational algebra. To remedy this, we are currently exploring a new strategy based on the flattening transformation as conceived by Guy Blelloch. Originally designed to implement the data-parallel declarative language NESL, we revisit flattening in the context of query compilation (which targets database kernels, one particular kind of data-parallel execution environment). Initial results are promising and DSH might switch over in the not too far future. We hope to further improve query quality and also address the formal correctness of DSH’s program-to-queries mapping.
Motivated by DSH we reintroduced the monad comprehension notation into GHC and also extended it for parallel and SQL-like comprehensions. The extension is available in GHC 7.2.
Query Compilation Based on the Flattening Transformation (Dagstuhl Seminar 14511 - Programming Languages for Big Data, Dec 14th, 2014). [Slides]
Proceedings of the 34th ACM SIGMOD Int'l Conference on the Management of Data (SIGMOD 2015), Melbourne, Australia, June 2015.
Proceedings of the 1st International Workshop on Data Driven Functional Programming (DDFP 2013), Rome, Italy. ACM, January 2013.
Proceedings of the 15th International Symposium on Practical Aspects of Declarative Languages (PADL 2013), Rome, Italy. Springer, January 2013.
Proceedings of the ACM SIGPLAN Haskell Symposium (Haskell 2011), Tokyo, Japan. ACM, 2011.
Revised selected papers of the 22nd International Symposium on Implementation and Application of Functional Languages (IFL 2010), Alphen aan den Rijn, Netherlands, volume 6647 of Lecture Notes in Computer Science. Springer, 2011.