Data Management

5. Graph databases

5.1. Introduction

Graphs are used in a wide range of applications, from fraud detection (see the Panama Papers) and anti-terrorism and social marketing to drug interaction research and genome sequencing.

hairball

Graphs or networks are data structures where the most important information is the relationship between entities rather than the entities themselves, such as friendship relationships. Whereas in relational databases you typically aggregate operations on sets, in graph databases you’ll more typically hop around relationships between records. Graphs are very expressive, and any type of data can be modelled as one (although that is no guarantee that a particular graph is fit for purpose).

Graphs come in all shapes and forms. Links can be directed or undirected, weighted or unweighted. They can be directed acyclic graphs (where no loops exist), consist of one or more connected components, and actually consist of multiple graphs themselves. The latter (so-called multi-layer networks) can e.g. be a first network representing friendships between people, a second network representing cities and how they are connected through public transportation, and both being linked by which people work in which cities.

graph types

5.2. Nomenclature

A graph consists of vertices (aka nodes, aka objects) and edges (aka links, aka relations), where an edge is a connection between two vertices. Both vertices and edges can have properties.

\[G = (V,E)\]

Any graph can be described using different metrics:

order of a graph = number of nodes
size of a graph = number of edges
graph density = how much its nodes are connected. In a dense graph, the number of edges is close to the maximal number of edges (i.e. a fully-connected graph).
- for undirected graphs, this is: \[\frac{2 |E|}{|V|(|V|-1)}\]
- for directed graphs, this is: \[\frac{|E|}{|V|(|V|-1)}\]
the degree of a node = how many edges are connected to the node. This can be separated into in-degree and out-degree, which are - respectively - the number of incoming and outgoing edges.
the distance between two nodes = the number of edges in the shortest path between them
the diameter of a graph = the maximum distance in a graph
a d-regular graph = a graph where the maximum degree is the same as the minimum degree d
a path = a sequence of edges that connects a sequence of different vertices
a connected graph = a graph in which there exists a direct connection between any two vertices

5.3. Centralities

Another important way of describing nodes is based on their centrality, i.e. how central they are in the network. There exist different versions of this centrality:

degree centrality: how many other vertices a given vertex is connected to. This is the same as node degree.
betweenness centrality: how many critical paths go through this node? In other words: without these nodes, there would be no way for to neighbours to communicate. \[ C_{B}(i)=\frac{\sum\limits_{j \neq k} g_{jk} (i)}{g_{jk}} \xrightarrow[]{normalize} C'_B = \frac{C_B(i)}{(n-1)(n-2)/2}\] , where the denominator is the number of vertex pairs excluding the vertex itself. \(g_jk(i)\) is number of shortest paths between \(j\) and \(k\), going through i; \(g_jk\) is the total number of shortest paths between \(j\) and \(k\).
closeness centrality: how much is the node in the "middle" of things, not too far from the center. This is the inverse total distance to all other nodes. \[C_C(i) = \frac{1}{\sum\limits_{j=1}^N d(i,j)} \xrightarrow[]{normalize} C'_C(i) = \frac{C_C(i)}{N-1}\]

In the image below, nodes in A are coloured based on betweenness centrality, in B based on closeness centrality, and in D on degree centrality.

centralities

5.4. Graph mining

Graphs are very generic data structures, but are amenable to very complex analyses. These include the following.

5.4.1. Community detection

A community in a graph is a group of nodes that are densely connected internally. You can imagine that e.g. in social networks we can identify groups of friends this way.

graph communities

Several approaches exist to finding communities:

null models: a community is a set of nodes for which the connectivity deviates the most from the null model
block models: identify blocks of nodes with common properties
flow models: a community is a set of nodes among which a flow persists for a long time once entered

The infomap algorithm is an example of a flow model (for a demo, see http://www.mapequation.org/apps/MapDemo.html).

5.4.2. Link prediction

When data is acquired from a real-world source, this data might be incomplete and links that should actually be there are not in the dataset. For example, you gather historical data on births, marriages and deaths from church and city records. There is therefore a high chance that you don’t have all data. Another domain where this is important is in protein-protein interactions.

Link prediction can be done in different ways, and can happen in a dynamic or static setting. In the dynamic setting, we try to predict the likelihood of a future association between two nodes; in the static setting, we try to infer missing links. These algorithms are based on a similarity matrix between the network nodes, which can take different forms:

graph distance: the length of the shortest path between 2 nodes
common neighbours: two strangers who have a common friend may be introduced by that friend
Jaccard’s coefficient: the probability that 2 nodes have the same neighbours
frequency-weighted common neighbours (Adamic/Adar predictor): counts common features (e.g. links), but weighs rare features more heavily
preferential attachment: new link between nodes with high number of links is more likely than between nodes with low number of links
exponential damped path counts (Katz measure): the more paths there are between two nodes and the shorter these paths are, the more similar the nodes are
hitting time: random walk starts at node A ⇒ expected number of steps required to reach node B
rooted pagerank: idem, but periodical reset to prevent that 2 nodes that are actually close are connected through long deviation

5.4.3. Subgraph mapping

Subgraph mining is another type of query that is very important in e.g. bioinformatics. Some example patterns:

[A] three-node feedback loop
[B] tree chain
[C] four-node feedback loop
[D] feedforward loop
[E] bi-parallel pattern
[F] bi-fan

network motifs

It is for example important when developing a drug for a certain disease by knocking out the effect of a gene that that gene is not in a bi-parallel pattern (V2 in image E above) because activation of node V4 is saved by V3.

5.5. Data modeling

In general, vertices are used to represent things and edges are used to represent connections. Vertex properties can include e.g. metadata such as timestamp, version number etc; edges properties often include the weight of a connection, but can also cover things like the quality of a relationship and other metadata of that relationship.

Below is an example of a graph: graph

Basically all types of data can be modelled as a graph. Consider our buildings table from above:

id	name	address	city	type	nr_rooms	primary_or_secondary
1	building1	street1	city1	hotel	15
2	building2	street2	city2	school		primary
3	building3	street3	city3	hotel	52
4	building4	street4	city4	church
5	building5	street5	city5	house
…	…	…	…	…	…	…

name

address

city

type

nr_rooms

primary_or_secondary

building1

street1

city1

hotel

building2

street2

city2

school

primary

building3

street3

city3

hotel

building4

street4

city4

church

building5

street5

city5

house

…

This can also be represented as a network, where:

every building is a vertex
every value for a property is a vertex as well
the column becomes the relation

For example, the information for the first building can be represented as such:

examplegraph

There is actually a formal way of describing this called RDF, but we won’t go into that here…