This is work performed by Daniel Alcaide, unless otherwise mentioned. It is currently being written up.

Patient profiling and selection receive growing attention due to the large economic and societal value. The involvement of analytical methods that are able to handle the increasing amount of healthcare data can make this process more agile and facilitate, for example, patient recruitment in clinical trials. However, these processes are currently extremely labor-intensive. Here we present the application of STAD on intensive care unit patients.

A proof-of-principle interface can be found at The code underlying this interface is available on github at

What’s the distance between diagnoses?

The MIMIC-III critical care database (described in this paper) contains deidentified health data for almost 60,000 intensive care unit patients. A lot of information is available for each patient, including a list of diagnoses (encoded using ICD-9). For see if we can find substructures in this patient population, we need to calculate distances between them, and we’ll focus on the diagnoses to do this.

Unfortunately, there is an issue: no simple distance metric exists for lists of diagnoses for patients. This is because they are categorical data (i.e. each diagnosis is a category) that are put in a specific order (i.e. the first diagnosis in the list is the most important, and importance drops as you go down the list).

Patient X Patient Y
199662Infection and inflammatory reaction due to other vascular device, implant, and graft14329Unspecified intracranial hemorrhage
299591Sepsis24019Unspecified essential hypertension
35990Urinary tract infection, site not specified (5990)399702Iatrogenic cerebrovascular infarction or hemorrhage
44019Unspecified essential hypertension499591Sepsis
55990Urinary tract infection, site not specified
643491Cerebral artery occlusion, unspecified with cerebral infarction

Codes 2, 3 and 4 of patient 1 correspond to codes 4, 5 and 2 of patient 2 (in that order). To make sure that not only presence/absence of a code is considered, but also its position, we can use the following distance metric:

\[M_{c_{X},c_{Y}} = ln(1 + \frac{1}{max(position_{c_{X}}, position_{c_{Y}})})\]

where $c_{X}$ and $c_{Y}$ are the same code in patient X or Y, respectively.

To get to the distance between patients rather than between a single code in 2 patients, we sum these values:

\[D(X,Y) = 1 - S(X,Y) = 1 - \sum_{i=1}^{n}M(X \cap Y)\]

What does such network look like?

Using this metric, the STAD network for patients in the MIMIC-III database that suffer from a “pathological fracture of vertebrae” looks like this:

As usual, colours are assigned automatically using community detection.

A complete user interface to explore these networks can be found at