Data transformations | Data Visualisation in Data Science

Sometimes we’ll want to do some calculations on the data before we actually visualise them. For example, we want to make a barchart that shows the average miles per gallon for each number of cylinders.

To do this, we’ll have to add a transform part to our specification:

{
  "data": ...,
  "transform": ...,
  "mark": ...,
  "encoding": ...
}

There is extensive documentation available for these transforms at https://vega.github.io/vega-lite/docs/transform.html. Possible transformations that we can apply are: aggregate, bin, calculate, density, filter, flatten, fold, impute, join aggregate, lookup, pivot, quantile, regression and loess regression, sample, stack, time unit, and window.

Filtering

In the case of filtering, it is quite clear what will happen: only the objects that match will be displayed. We can for example show a barchart of acceleration only for those cars that have 5 or fewer cylinders. One of the problems that we run into, is that the specification needs to be in JSON format. To say that we only want cars with 5 or fewer cylinders, we’ll use "filter": {"field": "Cylinders", "lte": "5"}. The lte stands for “less than or equal to”. There is also:

equal
lt (less than)
gt (great than)
gte (greater than or equal to)
range
oneOf

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "data": {
    "url": "https://raw.githubusercontent.com/vega/vega/master/docs/data/cars.json"
  },
  "transform": [
    {
      "filter": {"field": "Cylinders", "lte": "5"}
    }
  ],
  "mark": "bar",
  "encoding": {
    "x": {"field": "Cylinders", "type": "quantitative"},
    "y": {"field": "Acceleration", "type": "quantitative"},
    "size": {"value": 20}
  }
}

Another option is to use a filter like this: {"filter": "datum.Cylinders <= 5"} where datum stands for the single object, and .Cylinders will get the value for that property (“Cylinders” is capitalised in the data that we are loading, so we have to capitalise it here as well).

Both will give the following image:

A filter does not change the data objects itself. This is different for many other transformations. For example, we can calculate as well. For example, the “Year” attribute in each object is now a string, e.g. “1970-01-01”. It’d be good if this would be a number. We’ll need to look into vega expressions on how to do this here. There seem to be date-time functions, and it appears we can extract the year with year(datum.Year).

What does this do? This effectively adds a new field to each object, called yearonly. We can now use this new field as any other.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "data": {
    "url": "https://raw.githubusercontent.com/vega/vega/master/docs/data/cars.json"
  },
  "transform": [
      {"calculate": "year(datum.Year)", "as": "yearonly"}
  ],
  "mark": "point",
  "encoding": {
    "x": {"field": "Miles_per_Gallon", "type": "quantitative"},
    "y": {"field": "Acceleration", "type": "quantitative"},
    "color": {"field": "yearonly", "type": "ordinal"}
  }
}

Exercise - Create an image that plots the original Year versus the new yearonly.

Aggregation

So with calculations, we get an additional field. What if we want to aggregate? Let’s go back to our initial question: we want to have a barchart that shows the average miles per gallon for each number of cylinders. Below is the specification:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "data": {
    "url": "https://raw.githubusercontent.com/vega/vega/master/docs/data/cars.json"
  },
  "transform": [
    {
      "aggregate": [{
         "op": "mean",
         "field": "Acceleration",
         "as": "mean_acc"
      }],
      "groupby": ["Cylinders"]
    }
  ],
  "mark": "bar",
  "encoding": {
    "x": {"field": "Cylinders", "type": "quantitative"},
    "y": {"field": "mean_acc", "type": "quantitative"}
  }
}

In the documentation, we see that aggregate takes a AggregatedFieldDef[], and groupby takes a String[]. The [] after each of these indicates that they should be arrays, not single values. That is why we use "aggregate": [{...}] instead of "aggregate": {...} and "groupby": ["Cylinders"] instead of "groupby": "Cylinders".

Exercise - See if you can create a plot that shows the mean acceleration per year. So you’ll have to combine two transforms to do this. Your output picture should look like this:

Binning

As another example, let’s create a histogram of the miles per gallon. Looking at the documentation at https://vega.github.io/vega-lite/docs/bin.html, it seems that the easiest way to do this is to do this in the encoding section:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "data": {
    "url": "https://raw.githubusercontent.com/vega/vega/master/docs/data/cars.json"
  },
  "mark": "bar",
  "encoding": {
    "x": {"bin": true, "field": "Miles_per_Gallon", "type": "quantitative"},
    "y": {"aggregate": "count", "type": "quantitative"}
  }
}

The only thing to do was to add "bin": true to the field that you want to bin, and "aggregate": "count" to the other dimension. However, this approach is not very flexible, and for any use that is not this straightforward you will have to define the binning as a transform instead, like this:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "data": {
    "url": "https://raw.githubusercontent.com/vega/vega/master/docs/data/cars.json"
  },
  "transform": [
    {"bin": true, "field": "Miles_per_Gallon", "as": "binned_mpg"}
  ],
  "mark": "bar",
  "encoding": {
    "x": {"field": "binned_mpg", "bin": {"binned": true,"step": 1},"type": "quantitative"},
    "x2": {"field": "binned_mpg_end"},
    "y": {"aggregate": "count", "type": "quantitative"}
  }
}

Additional fields are added to the data object

When defining bin in a transform, it will create two new fields for each object: binned_mpg and binned_mpg_end. These indicate the boundaries of the bin that that object fits into. For example, the object

{
   "Name":"chevrolet chevelle malibu",
   "Miles_per_Gallon":18,
   "Cylinders":8,
   "Displacement":307,
   "Horsepower":130,
   "Weight_in_lbs":3504,
   "Acceleration":12,
   "Year":"1970-01-01",
   "Origin":"USA"
}

becomes

{
   "Name":"chevrolet chevelle malibu",
   "Miles_per_Gallon":18,
   "Cylinders":8,
   "Displacement":307,
   "Horsepower":130,
   "Weight_in_lbs":3504,
   "Acceleration":12,
   "Year":"1970-01-01",
   "Origin":"USA",
   "binned_mpg": 15,
   "binned_mpg_end": 20
}

Yet another way of creating a histogram is to work with two transforms: one to bin the data, and one to count the number of elements in the bin. This basically takes the output of the binning transform (i.e. the new binned_mpg field from above) and calculates the count on that. This way, the encoding is simpler to understand and we don’t have to do magic incantations within the definition of x and y.

{
  "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
  "data": {
    "url": "https://raw.githubusercontent.com/vega/vega/master/docs/data/cars.json"
  },
  "transform": [
    {"bin": true, "field": "Miles_per_Gallon", "as": "binned_mpg"},
    {
      "aggregate": [{
         "op": "count",
         "field": "binned_mpg",
         "as": "count_mpg"
      }],
      "groupby": ["binned_mpg"]
    }
  ],
  "mark": "bar",
  "encoding": {
    "x": {"field": "binned_mpg","type": "quantitative"},
    "y": {"field": "count_mpg", "type": "quantitative"}
  }
}

Exercise - Create a plot showing the mean acceleration per bin of miles per gallon.