retrospect-lang

Retrospect

Retrospect is a system for scalable data analysis.

What?

“Scalable data analysis” just means computing an output from input data.

The alternatives

If you are doing data analysis today you are likely choosing one of three approaches:

  1. Writing a program in a language like Python (or R, Matlab, Julia, etc.). These tools are expressive and easy to use, but they do not scale; programs are limited to the resources of a single machine.

  2. Writing a SQL script. SQL is easy to use, and systems like Presto or BigQuery can run SQL queries across large datasets using many machines. But SQL is significantly less expressive than conventional programming languages, and as the analysis required gets more complex SQL quickly becomes unusable.

  3. Using a scalable data processing system like Spark or Dask. These frameworks are designed to support distributed data analysis, and combine a conventional programming language with libraries that manage pools of machines and data. But they are significantly more difficult to use than either of the first two options, and intended more for developing large-scale pipelines than for exploratory research.

So you get to choose two of “easy-to-use”, “expressive”, and “scalable”; if you pick one of the easy-to-use options but later decide that you need “expressive” or “scalable” after all, you’ll have to start over with a new and much less easy-to-use approach.

As you might have guessed by now, Retrospect’s goal is to provide all three.

How?

To deliver easy-to-use, expressive, and scalable data analysis we are building

  1. A suitable language. The goal is to create something with the simplicity and expressiveness of Python, but with map-reduce built in. Map-reduce provides a way to express a large class of computations that can be efficiently distributed; building it into the language ensures that processing a collection in parallel is as easily expressed and efficient as processing it sequentially (“loops should be parallel by default”). Compiling scripts in this language to a simple abstract machine insulates the rest of the system from the details of language design.

  2. A fast abstract-machine implementation. Large-scale data analysis will often involve processing millions, billions, or even trillions of individual data items, so we need the inner loops to be fast. Fortunately dynamic code generation can enable us to execute even a high-level abstract machine at speeds comparable to traditional performance-oriented languages.

  3. A framework for distributed computation. Fast execution on a single machine is necessary, but not sufficient for the largest analyses. The language and the abstract machine are designed to enable transparently distributing any Retrospect program across multiple machines.

Does it work?

So far we have a candidate language and an abstract machine implementation that can run a wide variety of complex programs with very good performance. We have plans for distributed computation, but nothing implemented yet.

Who is building this?

Retrospect began as an exploratory project within the Google Earth Engine team, but is now independent of Google. If you would like to help, please take a look at our repo and email retrospectlang@gmail.com.