I0U19A - Management of large-scale omics data | Augmented Intelligence for Data Analytics (AIDA) Lab

Spark exercise session

For the spark exercise, we will use a jupyter notebook with Google Colaboratory. I will send you an invitation to access the notebook.

spark colaboratory

When you receive the invitation,

File -> Save a copy in Drive... (the file that I share is read only => you can only save after you made your own copy!)
Replace the YOURNAMEHERE in the title with your name.
Do the exercises.
When finished, create a PDF (File -> Print -> Print to PDF) and submit to Toledo.

MongoDB exercise session

We’re running a mongoDB instance in the cloud using MongoDB Atlas, which is MongoDB-as-a-service.

For the mongoDB exercise, we’ll also use a jupyter notebook with Google Colaboratory like for the spark exercise above. You’ll find a link on Toledo to a read-only exercise notebook: make a copy on your own Google Drive before starting the exercise.

mongodb colaboratory

When finished, create a PDF (File -> Print -> Print to PDF) and submit to Toledo.

Neo4j exercise session

For the neo4j exercise, we’ll use the sandbox provided by neo4j itself. Go to http://neo4j.com/sandbox-v2 and log in (create a new account if necessary). You’ll be presented with a number of sandbox options (datasets). We’ll use the one named Russian Twitter Trolls. Click on “Launch Sandbox” and next on “Visit the Neo4j Browser”.

The sandboxes at neo4j provide an extensive tutorial on the background of the data, what it was used for, and how to query it. Go through the tutorial (use the right arrow to advance to the next section). If you close the tutorial card by accident, you can open it again with :play https://guides.neo4j.com/sandbox/twitter-trolls/index.html.

Neo4j tutorial

When you’re in the graph view (see image), you can click on a node and then on a property at the bottom of the card. This will show the value for that property in the nodes themselves. Very useful if you e.g. want to have the tweet text shown in the nodes.

neo4j node labels Node labels

Note: The datamodel presented in the tutorial itself is not entirely complete. Check the correct schema with the command CALL db.schema().

neo4j incomplete datamodel The incomplete datamodel

There are 14,273 users in the database, 453 of which are labelled as Troll. For example, the Troll with user_key scottgohard is the same as the User with user_key scottgohard. To find the users that are not trolls, you can use this query: MATCH (u) WHERE u:User AND NOT u:Troll RETURN u;.

There are 2 ways to look at the results from any query: as graph or as table. To switch between the two, click on the button at the left of the output screen. Depending on what information you want, you’ll need to switch between the two.

neo4j output as graph

neo4j output as table

At the end of the tutorial you will be presented with some ideas for further exploration. There’s a link to an NBC News Article: Russian trolls went on attack during key election moments. Read the article, come back to the “ideas for further exploration” page and answer the questions below. Create a PDF report to be uploaded on Toledo where - for each of these questions - you list the answer as well as the Cypher query that you used. Unless mentioned otherwise, each of the questions below should be answered using a single query.

What is the data schema? (Include the image in your report)
How many tweets were retweets (i.e. not original tweets)?
What is the text of the tweet that was retweeted most?
Which 10 hashtags were most prevalent in the retweets?
Who are the 10 trolls that tweet the most?
When trying to list the 10 non-trolls that tweet the most, you’ll get an empty list. Why is this?
Who are the 10 most mentioned non-trolls (and does anyone know Blicqer Media)?
Two of these 10 most mentioned non-trolls are @cnn and @blicqer. On its twitter homepage, @blicqer apparently refers to blicqermedia.com. However, this website cannot be reached. To get an idea of how @blicqer is different from @cnn, list the 10 top hashtags for tweets where @cnn is mentioned, and the 10 top hashtags for tweets where @blicqer is mentioned. (You can use two queries for this)
There is one troll in the database who has mentioned “Brussel, België” as their location. What is his/her name and how many followers does he/she have? Does the combination of their location and time zone make sense?
There is a peak in tweets on 22 March 2016 when three suicide bombers kill 32 people in Brussels. What are the top 10 hashtags on that day (include the number of times they are mentioned)?