Spark as a service
As mentioned earlier on this blog, we are using Spark as a backend for our data processing. In our earlier work, we tied Processing and Spark together by an intermediate layer that constitutes of a lazy tree zipper data structure to represent the different zoom levels in our data.
This approach has one big disadvantage: all code is tied together in a single set of classes and objects. Although largely functional in nature (except for the visualisation part using Processing), it is not a very scalable solution.
Another approach would be to present Spark as a service and connect to it via REST calls. Luckily, the guys from Ooyala have made available an extension to Spark that allows for just that. In this post, we experiment with the service. In a later post, we will create a visualisation based on it.
Installation
The installation instructions can be found here, although we need only a subset of these instructions. First, download from GitHub:
Then, compile:
This launches the jobserver to listen on port 8090 by default. Please note that the first command starts sbt
(the Scala Build Tool), the other two are sbt
tasks.
Our first genome service
In order to test it, we can use the built-in test or create one of our own.
We changed the existing word count test file into:
We use the same example file as before. Please note that error handling is avoided by always returning SparkJobValid
as a result of the validation step. Also note that this is very simple service that does not take arguments (yet).
The runJob
method results in a List
that represents the triple: (Chromosome, TranscriptionFactor, frequency)
.
Compiling can be done using sbt
, after which we restart the job server:
Interacting with the job server can be done using, e.g. curl
. We first deploy our code for handling requests:
curl --data-binary @jobserver/target/scala-2.9.3/spark-job-server_2.9.3-0.9.0-incubating-SNAPSHOT-test.jar localhost:8090/jars/test
Please note that you might have to adapt that for your local Scala and Spark version.
Almost ready… We can now query our job server using the following POST
request:
curl -d "" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.TransFactors&sync=true'
In the request, we specify that we want to wait for the result (sync=true
). The result is a JSON
file with two fields: status
and result
.
Interacting with the Spark job server
The full API is explained here. It allows to create Spark Contexts, start jobs (a)synchronously, query jobs, etc. In a followup post, I will describe how the Spark job server can be used to act as a backend for the visualisations.