… This must be the best blog post title around …
Based on the Locustree concept by Jan Aerts, I have been working on a tree representation for storing zoom and pan/shift information. This post aims to describe the context, a possible implementation in Scala and its use in visualisation.
Let’s assume for a while that we have a large amount of data that can be mapped onto one dimension. One possible use-case is genome data, but all sorts of problems can be rephrased like this. The idea is to represent the data on a screen for visual analysis. In other words, one dimension defines the scale and on top of that we have information: gene expressions, transcription factor binding, amino acids, …
Please note that a multi-dimensional version of the locustree is in the making, stay tuned.
The concept of a locustree stems from the fact that it does not make sense for us to represent the full (remember 1D) dataset on a screen, it is simply too small for that. And even if we would draw all the points, we would not be able to notice any remarkable aspects of our data. In other words, it makes sense for us to look at the data at different zoom levels. At the highest zoom level we see individual data points, the lowest zoom level gives an aggregate idea of the full dataset and different zoom levels can exist in-between. A tree representation with a storage backend as a binary file representation has been created by Jan Aerts (see here for the source) where every level in the tree represents a zoom level. At all but the largest zoom level, we are interested in aggregate information about the underlying data. In his version, Jan did not have the computational power to render the genome data from the raw data on-the-fly. This meant he had to pre-process the data for the different resolutions and store this intermediate data on disk. We are investigating whether it is possible to avoid this intermediate step and immediately start from the raw data.
Another requirement we have put forward is to employ a functional approach to programming whenever possible. The primary reason being that functional programming (see here for a fun way to learn about FP) leads to immutable data structures which in turn leads to easier distribution and clustering of the algorithms and data. And less headaches while developing…
But wait… if data structures are immutable, how can my program do something useful? Take a look at the Spark examples (they are actually Scala examples) in a previous post. Did you notice that the variables (or actual values) do not change? Since these are only pointers to the data anyway, no computational or memory overhead is generated.
It turns out that creating a (locus)tree in a functional way is not all that hard. But in order for the tree to have some awareness of where the active node is, is a different story especially when we want to avoid copying memory blocks all the time. The name that is usually used for a functional datastructure that is aware of location is a zipper. We’re almost there… one more concept needs to be introduced…
One concept we still need to introduce is lazy evaluation. For us humans, it may as much as: calculate or evaluate only when necessary. The most obvious example is when generating an infinite sequence of numbers or events. In a lazy sense, this is possible. Only when traversing the sequence will there be an evaluation of the entries.
We need something similar for our tree. Only when we access a certain node in the tree do we want to calculate or generate the data for that node.
Building a simple binary zipper
Pasting the following code in a Scala worksheet does the trick:
This implementation is not perfect, there could be some sealing and other things, but it gets the message across.
The first part takes care of constructing the tree. The location is a way to denote a spot in the tree by means of a path to the spot from the root node (in reverse order). Basically, it can be seen as a trace of the path to the current (active) node.
The zipper is then simply the combination of the current node/tree and the path to it.
Initialisation is can easily be done:
The history for the root node (and the full tree) is the zero list. Starting from
root and selecting the left child is easy:
Can you appreciate the coolness of this? In principle, we should write
zipper.left(), but Scala allows us to write in human readable text…
This is not all, however. The result, given the class definitions above results in an encapsulated result of type
Option[Zipper]. We could have omitted the
Option collection, but we would have to carefully handle exceptions like taking the parent of the root node or taking the child of an empty node.
Option allows us to the deal with these exceptions in a graceful way.
In order to extract the value from an Option, we can simply use the
Conclusion for now
To wrap up: we have introduced the different concepts that are necessary to understand the concept of a lazy functional tree zipper. Additionally, we have presented a very simple (binary) implementation in Scala.