Spark exercise session
For the spark exercise, we will use a jupyter notebook with Google Colaboratory. I will send you an invitation to access the notebook.
When you receive the invitation,
Save a copy in Drive...(the file that I share is read only => you can only save after you made your own copy!)
- Replace the
YOURNAMEHEREin the title with your name.
- Do the exercises.
- When finished, create a PDF (
MongoDB exercise session
We’re running a mongoDB instance in the cloud using MongoDB Atlas, which is MongoDB-as-a-service.
For the mongoDB exercise, we’ll also use a jupyter notebook with Google Colaboratory like for the spark exercise above. You’ll find a link on Toledo to a read-only exercise notebook: make a copy on your own Google Drive before starting the exercise.
When finished, create a PDF (
Neo4j exercise session
For the neo4j exercise, we’ll use the sandbox provided by neo4j itself. Go to http://neo4j.com/sandbox-v2 and log in (create a new account if necessary). You’ll be presented with a number of sandbox options (datasets). We’ll use the one named Russian Twitter Trolls. Click on “Launch Sandbox” and next on “Visit the Neo4j Browser”.
The sandboxes at neo4j provide an extensive tutorial on the background of the data, what it was used for, and how to query it. Go through the tutorial (use the right arrow to advance to the next section).
If you close the tutorial card by accident, you can open it again with
When you’re in the graph view (see image), you can click on a node and then on a property at the bottom of the card. This will show the value for that property in the nodes themselves. Very useful if you e.g. want to have the tweet text shown in the nodes.
Note: The datamodel presented in the tutorial itself is not entirely complete. Check the correct schema with the command
The incomplete datamodel
There are 14,273 users in the database, 453 of which are labelled as Troll. For example, the Troll with
scottgohard is the same as the User with
scottgohard. To find the users that are not trolls, you can use this query:
MATCH (u) WHERE u:User AND NOT u:Troll RETURN u;.
There are 2 ways to look at the results from any query: as graph or as table. To switch between the two, click on the button at the left of the output screen. Depending on what information you want, you’ll need to switch between the two.
At the end of the tutorial you will be presented with some ideas for further exploration. There’s a link to an NBC News Article: Russian trolls went on attack during key election moments. Read the article, come back to the “ideas for further exploration” page and answer the questions below. Create a PDF report to be uploaded on Toledo where - for each of these questions - you list the answer as well as the Cypher query that you used. Unless mentioned otherwise, each of the questions below should be answered using a single query.
- What is the data schema? (Include the image in your report)
- How many tweets were retweets (i.e. not original tweets)?
- What is the text of the tweet that was retweeted most?
- Which 10 hashtags were most prevalent in the retweets?
- Who are the 10 trolls that tweet the most?
- When trying to list the 10 non-trolls that tweet the most, you’ll get an empty list. Why is this?
- Who are the 10 most mentioned non-trolls (and does anyone know Blicqer Media)?
- Two of these 10 most mentioned non-trolls are @cnn and @blicqer. On its twitter homepage, @blicqer apparently refers to blicqermedia.com. However, this website cannot be reached. To get an idea of how @blicqer is different from @cnn, list the 10 top hashtags for tweets where @cnn is mentioned, and the 10 top hashtags for tweets where @blicqer is mentioned. (You can use two queries for this)
- There is one troll in the database who has mentioned “Brussel, België” as their location. What is his/her name and how many followers does he/she have? Does the combination of their location and time zone make sense?
- There is a peak in tweets on 22 March 2016 when three suicide bombers kill 32 people in Brussels. What are the top 10 hashtags on that day (include the number of times they are mentioned)?