This post is part of a collection for our students in “Managing Large Omics Datasets” (I0U19A) at the KU Leuven. In this exercise, we will perform queries on a MongoDB database that has been populated with the beer dataset.

Preparation

As with the Hadoop exercises, we’ll use Docker containers. See this blog post with the hadoop exercise for a refresher.

To run, type docker run -d -p 27017:27017 jandot/mongo-i0u19a, and then:

  • if you have the mongo client locally: mongo --host 192.168.99.100
  • if you don’t: docker run -it --rm jandot/mongo-i0u19a /bin/bash, and then mongo --host 192.168.99.100

In case you’re running this on a linux machine, you can remove the --host 192.168.99.100 part.

1. Simple Queries using the Mongo Shell

In doing these exercises, please refer to the MongoDB documentation at the following sites:

Connect to the mongoDB instance using the mongo shell as indicated above.

Jans-MacBook-Air-2:~ jaerts$ mongo --host 192.168.99.100
MongoDB shell version: 3.0.4
connecting to: 192.168.99.100:27017/test
>

You now get the MongoDB prompt in which we will work. At the end, you can escape the MongoDB shell by typing exit. Type

> show dbs

to know what databases are available. Connect to the database for this exercise:

> use i0u19a

The command

> show collections

returns the list of collections (“tables”) that are stored in the collection

To exit the MongoDB shell and return to your linux prompt, use

> quit()

Now what does the data look like? Each document covers a single beer. Let’s check what a document looks like:

> db.beers.findOne()

The output shows us that one beer can have more than one type, for example.

{
	"_id" : ObjectId("57193db714ab80806edea386"),
	"beer" : "Postel Tripel",
	"brewery" : "Affligem Brouwerij",
	"type" : [
		"blonde tripel",
		"abdijbier"
	],
	"alcoholpercentage" : 8.5
}

Exercises

To get our feet wet, we’ll first try some very simple queries. Using the beers collection:

  • How many beers are there in the database?
  • Return the first 5 beers.
  • How many beers in the database are of type “blond troebel”?
  • Of these “blond troebel” beers, only return the name of the beer.
  • How many beers have a percentage alcohol of more than 8 degrees?
  • How many beers have low alcohol (“alcoholarm”)?

2. Aggregate in MongoDB

It is possible in MongoDB to create pipelines to process data while querying by sending a stream of data through a list of commands, similar to how you would write a pipeline on the linux command line. This comes in very handy when wanting to aggregate data. Specific commands that you can use include (but are not limited to):

  • $project: reshape each document
  • $match: filter the stream
  • $limit: return only the first n documents
  • $unwind: deconstruct a list in each document into separate documents
  • $group: group documents by a given identifier
  • $sample: take a random sample
  • $out: write the results to a new collection. If used, this should be the last step of the pipeline.

For a full list of commands to use in aggregation in MongoDB, see https://docs.mongodb.org/manual/reference/operator/aggregation-pipeline/

A full list of accumulators for $group can be found here: https://docs.mongodb.org/manual/reference/operator/aggregation/group/, and includes $sum, $avg, $max, $min and others.

These commands can be combined in different ways to alter the stream as it passes through them. For example, let’s select those beers that have a percentage of more than 8 degrees, get the average of these per brewery, and finally take a sample.

db.beers.aggregate([
  {$match: {alcoholpercentage: {$gt: 8}}},
  {$group: {_id: "$brewery", avg: {$avg: "$alcoholpercentage"}}},
  {$sample: {size: 5}}
])

The output of this command looks like this:

{ "_id" : "Brouwerij De Graal voor t'Drankorgel", "avg" : 8.3 }
{ "_id" : "Brasserie Saint Feuillien", "avg" : 9.5 }
{ "_id" : "Brouwerij Lefebvre", "avg" : 8.3 }
{ "_id" : "Brouwerij du Bocq voor Corsendonk nv", "avg" : 8.5 }
{ "_id" : "Brouwerij Liefmans", "avg" : 8.5 }

Exercises

  • What is the average alcoholpercentage per brewery?
  • Which breweries have an average alcohol percentage higher than 10 degrees? Return these in descending order of alcoholpercentage.
  • What is the average alcoholpercentage per type of beer? Sort by alcoholpercentage.
  • What is the range (max - min) of alcoholpercentage for beers per brewery that brews more than 1 beer?

3. MapReduce in MongoDB

Although aggregate in MongoDB is very useful and easy to use, mapreduce does provide some more flexibility.

To use mapreduce, you will define two functions: a map function and a reduce function. See the hadoop exercise from an earlier post for a refresher. However, the mongo shell uses javascript rather than python as its language.

So remember: one document looks like this:

{
  "_id" : ObjectId("57193db714ab80806edea386"),
  "beer" : "Postel Tripel",
  "brewery" : "Affligem Brouwerij",
  "type" : [
    "blonde tripel",
    "abdijbier"
  ],
  "alcoholpercentage" : 8.5
}

Suppose we want to know the number of beers per brewery.

1. Define the map function to process each input document

  • In the function, the keyword this refers to the document that the map-reduce operation is processing.
  • The function maps the number 1 to the brewery for each document and emits this pair. The 1 is basically the count of number of beers for that brewery for that specific document.

In code for the Mongo shell, this means:

var mapFunction1 = function() {
  emit(this.brewery, 1);
};

2. Define the corresponding reduce function with two arguments brewery and counts:

When we will run the mapReduce command (see 3 below), the values of the map function (in this case: the 1’s) are automatically put in an array before they are handed to the reduce function. In other words:

  • your map function should return something like this:
brewery1   1
brewery2   1
brewery3   1
brewery2   1
brewery2   1
brewery3   1
  • but what your reduce function gets, is this:
brewery1   [1]
brewery2   [1,1,1]
brewery3   [1,1]

So based on this, our reduce function should just sum the elements of the value array for each input that it is handed.

var reduceFunction1 = function(brewery, values) {
  return Array.sum(values)
}

3. Perform the map-reduce on all documents in the beers collection using the mapFunction1 map function and the reduceFunction1 reduce function:

We finally combine the map and reduce functions using the mapReduce function. Note that you need to tell mongo mapReduce where to put the results. You can either put them in a new collection, or have them send to the screen.

db.beers.mapReduce(
  mapFunction1,
  reduceFunction1,
  { out: "numberBeersPerBrewery" }
)

This command will create a new collection named numberBeersPerBrewery that will contain the results. To have the output sent to the screen instead, use { out: {inline: 1} } instead.

Debugging tip: to find out if your map function does what you expect it would do, write a reduce function that does nothing, e.g.

var reduceFunctionTest = function(key, values) {
  return {k: key, v: values}
}

You can find info about MapReduce in MongoDB here: http://docs.mongodb.org/manual/core/map-reduce/ and http://docs.mongodb.org/manual/tutorial/map-reduce-examples/.

Exercises

If you followed along, you will now have an additional collection called numberBeersPerBrewery.

  • Using the numberBeersPerBrewery collection that you just generated, get the top-10 of the breweries. How can we sort from high to low?
  • Find all entries in the collection numberBeersPerBrewery, that contain the word ‘Inbev’ in the brewery field. You will probably get 3 results. However, there should be 9. Why? How can you solve that?
  • Difficult: using a single mapreduce on the beers collection, calculate the maximum alcohol percentage per type of beer.
  • Difficult: using a single mapReduce on the beers collection, calculate the average alcohol percentage per type of beer. Remember that in order to calculate an average, you will first need a sum and a count. Hint: This exercise will require you to define a finalizing step in the MapReduce operation. Revisit the MongoDB examples if this doesn’t ring a bell. Also: watch out: reduce will not run if there is only one element for a given key (see this stackoverflow discussion. You’ll have to capture this in the finalize step.