Applying STAD to ICD9 diagnosis codes and developing a new distance metric on the way

This is work performed by Daniel Alcaide, unless otherwise mentioned. It is currently being written up.

Patient profiling and selection receive growing attention due to the large economic and societal value. The involvement of analytical methods that are able to handle the increasing amount of healthcare data can make this process more agile and facilitate, for example, patient recruitment in clinical trials. However, these processes are currently extremely labor-intensive. Here we present the application of STAD on intensive care unit patients.

A proof-of-principle interface can be found at https://dalcaide.shinyapps.io/diagnosis_explorer/. The code underlying this interface is available on github at https://github.com/vda-lab/ICD_diagnosis_explorer.

What’s the distance between diagnoses?

The MIMIC-III critical care database (described in this paper) contains deidentified health data for almost 60,000 intensive care unit patients. A lot of information is available for each patient, including a list of diagnoses (encoded using ICD-9). For see if we can find substructures in this patient population, we need to calculate distances between them, and we’ll focus on the diagnoses to do this.

Unfortunately, there is an issue: no simple distance metric exists for lists of diagnoses for patients. This is because they are categorical data (i.e. each diagnosis is a category) that are put in a specific order (i.e. the first diagnosis in the list is the most important, and importance drops as you go down the list).

Patient X			Patient Y
Order	ICD	Description	Order	ICD	Description
1	99662	Infection and inflammatory reaction due to other vascular device, implant, and graft	1	4329	Unspecified intracranial hemorrhage
2	99591	Sepsis	2	4019	Unspecified essential hypertension
3	5990	Urinary tract infection, site not specified (5990)	3	99702	Iatrogenic cerebrovascular infarction or hemorrhage
4	4019	Unspecified essential hypertension	4	99591	Sepsis
			5	5990	Urinary tract infection, site not specified
			6	43491	Cerebral artery occlusion, unspecified with cerebral infarction

Codes 2, 3 and 4 of patient 1 correspond to codes 4, 5 and 2 of patient 2 (in that order). To make sure that not only presence/absence of a code is considered, but also its position, we can use the following distance metric:

\[M_{c_{X},c_{Y}} = ln(1 + \frac{1}{max(position_{c_{X}}, position_{c_{Y}})})\]

where $c_{X}$ and $c_{Y}$ are the same code in patient X or Y, respectively.

To get to the distance between patients rather than between a single code in 2 patients, we sum these values:

\[D(X,Y) = 1 - S(X,Y) = 1 - \sum_{i=1}^{n}M(X \cap Y)\]

What does such network look like?

Using this metric, the STAD network for patients in the MIMIC-III database that suffer from a “pathological fracture of vertebrae” looks like this:

As usual, colours are assigned automatically using community detection.

A complete user interface to explore these networks can be found at https://dalcaide.shinyapps.io/diagnosis_explorer/.