Case Study: Air Quality
This notebook analyses the air-quality dataset collected by the Castile and León initiative in Spain, previously analysed by Alcaide & Aerts (2020). They found several interesting patterns that we attempt to re-discover using lensed UMAP. For example:
Seasonal patterns related to emmisions due to heating systems.
Seasonal patterns in Ozon due to variations in UV radiation.
Decreases is particulate matter due to new vehicle restrictions (2002).
General decreases in emmissions up to (2010)
Setup
Loading libraries:
[1]:
%load_ext autoreload
%autoreload 2
[2]:
# Data
import time
import datetime
import numpy as np
import pandas as pd
from scipy.stats import kstest
from sklearn.impute import KNNImputer
from sklearn.preprocessing import RobustScaler
# UMAP
from umap import UMAP
import lensed_umap as lu
Custom code:
We use several functions throughout this notebook to simplify working with the data and UMAP models. The (hidden) cells in this section define those functions and constants.
Plotting constants:
Category values for: seasons, periods, features
Tick values for: years
Category to color maps for: seasons, periods, features
Data processing:
store_coordinages
stores embedding coordinates in the dataframe to simplify plotting.edge_paths
transforms edges into a dataframe to efficiently plot them using datashader.melt_features
pivots dataframe to treat features as categories when plotting.classify_year
splits years into 4 periods / segments.
HoloViz plotting:
shade_edges
draws edges inspired by UMAP’s advanced plotting functionality.shade_points
colours points by numerical features (one colormap).shade_categories
colours points by categories.
The data
The data describes seven chemical’s concentration in the air (CO (mg/m3), NO (ug/m3), NO2 (ug/m3), SO2 (ug/m3), O3 (ug/m3), PM10 (ug/m3), PM25 (ug/m3)). Measurements are taken daily at several locations, with more locations being added over time. The measurements started in 1997 and ran till the end of 2020. We load the data from a .csv
file and add columns indicating the year, week, and season:
[7]:
# Load data
df = pd.read_csv("./data/air/calidad-del-aire-datos-historicos-diarios.csv", sep=";")
df.rename(
columns={"Fecha": "date", "Latitud": "latitude", "Longitud": "longitude"},
inplace=True,
)
df.drop(columns=["Provincia", "Estación", "Posición"], inplace=True)
df["year"] = df.date.apply(lambda x: int(x.split("-")[0]))
df["week"] = df.date.apply(
lambda x: datetime.date(*[int(s) for s in x.split("-")]).isocalendar().week
).astype(int)
df["season"] = df.date.apply(lambda x: seasons[int(x.split("-")[1]) % 12 // 3])
df["season"] = df["season"].astype("category").cat.set_categories(seasons, ordered=True)
df["periods"] = df.year.apply(classify_year).astype("category").cat.set_categories(periods, ordered=True)
The data contains numerous missing values, summarised below:
[8]:
df[vdims] = df[vdims].mask(df[vdims] < 0)
df.isnull().sum() / df.shape[0]
[8]:
date 0.000000
CO (mg/m3) 0.773195
NO (ug/m3) 0.069475
NO2 (ug/m3) 0.072906
O3 (ug/m3) 0.382499
PM10 (ug/m3) 0.227426
PM25 (ug/m3) 0.879412
SO2 (ug/m3) 0.201290
latitude 0.000507
longitude 0.000507
year 0.000000
week 0.000000
season 0.000000
periods 0.000000
dtype: float64
Alcaide & Aerts (2020) dealt with these missing values by aggregating measurements up to weeks. In this notebook, we take another approach, ensuring we keep locations separate and retain a larger number of observations. Specifically, we remove the columns with more than 40% missing values. Then, we remove observations with missing values in columns with more than 10% missing values. Finally, the remaining missing values are imputed using a KNNImputer. The concentrations are then scaled to be in the same value-range using a robust z-score. The resulting dataset contains 181.386 observations.
[ ]:
# Drop columns with more than 40% missing
df = df.drop(columns=["CO (mg/m3)", "PM25 (ug/m3)"])
# Drop observations with missing values in columns with more than 10%
mask = (
df[["O3 (ug/m3)", "PM10 (ug/m3)", "SO2 (ug/m3)", "latitude", "longitude"]]
.isna()
.any(axis=1)
)
df = df.loc[~mask, :]
# Impute missing values in columns with less than 10%
df[vdims] = KNNImputer(n_neighbors=5, weights="distance").fit_transform(df[vdims])
df[vdims_norm] = RobustScaler().fit_transform(df[vdims])
Build UMAP
The UMAP model is constructed using a cosine distance. Computing the UMAP model and its embedding takes a bit of time at this dataset size, we measure the time to report them later.
[10]:
start_time = time.perf_counter()
projector = UMAP(
n_neighbors=50,
metric="cosine",
n_epochs=1000,
transform_mode="graph",
).fit(df[vdims_norm])
umap_time = time.perf_counter()
projector = lu.embed_graph(projector)
embed_time = time.perf_counter()
store_coordinates(df, projector)
paths = edge_paths(projector)
storage_time = time.perf_counter()
print(f"Build UMAP Model: {umap_time - start_time:.2f}s")
print(f"Embed UMAP Model: {embed_time - umap_time:.2f}s")
print(f"Store UMAP coordinates and edges: {storage_time - embed_time:.2f}s")
Build UMAP Model: 17.74s
Embed UMAP Model: 111.03s
Store UMAP coordinates and edges: 41.94s
The embedding is shown below, once by drawing the edges and three times by drawing the points.
The first points-plot uses datashader to summarize feature values across the embedding, in a way inspired by Silva et al. and Thijssen et al.. The resulting figure emphasizes which feature has the highest value accross the manifold by blending hues for each feature. The purer a colour the more dominate that features is. This techinque works well because the data was normalized and a cosine distance was used, making the manifold relfect the feature’s relative values.
The second points-plot uses the same categorical shading technique to illustrate the observations’ season-distrubtion along the manifold. All seasons appear to occur throughout the manifold, but at some points, particular seasons appear more often, resulting in a purer color.
The last points-plot colours the manifold by year, indicating how time relates to the manifold. There appear to be three distinct ‘recent’ states with one ‘older’ state.
[9]:
mdf = melt_features(df, vdims_norm[1:]).query("value > -0.5").rename(columns={'feature': 'Highest feature'})
(
shade_edges(paths) +
shade_categories(mdf, 'Highest feature', cat_agg=ds.mean('value'),
color_key=feature_key, width=380) +
shade_categories(df, 'season', color_key=season_key) +
shade_points(df, ['year'], width=380)
).cols(2)
[9]:
A small multiples view showing each feature indicates that the recent states are separated by their SO2 concentration. Note that the colormap is adjusted to each feature’s value distribution to better visualise their patterns. As a result, a ‘linear’ change in the colormap does not correspond to a ‘linear’ change in the data! Generally, we see the reduction in emissions over time. Also the seasonal effect of Ozon (O3) is visible, as there are both high and low value observations across all years. The particulate matter (PM10) behaves slightly different from the nitrogen chemicals; it more closely matches years.
[10]:
shade_points(df, vdims, cnorm="eq_hist", colorbar=False, width=200, height=200).cols(3)
[10]:
Year lens
Alcaide & Aerts (2020) found some interesting changes over time in this data. We did not find many of those yet. To better investigate patterns over time, we apply a lens on the year dimensions, allowing only the edges between adjacent years to remain.
[13]:
start_time = time.perf_counter()
lensed_year = lu.apply_lens(
projector,
values=df.year,
resolution=(df.year.max() - df.year.min()) + 1,
repulsion_strength=0.05,
skip_embedding=True,
)
lens_time = time.perf_counter()
# make sure to use previous embedding as initial state
lensed_year.embedding_ = projector.embedding_
lensed_year = lu.embed_graph(lensed_year)
embed_time = time.perf_counter()
store_coordinates(df, lensed_year, suffix="year")
paths_year = edge_paths(lensed_year)
storage_time = time.perf_counter()
print(f"Apply lens: {(lens_time - start_time) * 1000:.2f} microseconds")
print(f"Embed model: {embed_time - lens_time:.2f}s")
print(f"Store coordinates and edges: {storage_time - embed_time:.2f}")
Apply lens: 28.65 microseconds
Embed model: 55.63s
Store coordinates and edges: 11.41
In the resulting embedding, time appears to correlate with the horizontal axis, with earlier years towards the right and more recent years on the left. This makes it easier to interpret changes in the manifold over time. One way to interpret the year-lens, is as a way to reveal which distinct states exist each year, and how those states progress across years. This idea is similar to the Mapper algorithm and Reeb-graphs more generally. Globally, the embedding’s shape can be described as a left-ward falling dropllet with two arms attached to a \(\Theta\)-like structure’s left-upper side.
There are several states that appear to be disconnected over time:
High SO2 observations towards the bottom-right side are not well connected across 2008-2009.
Low NO2 observations after 2008 are not well connected to previous years. Only the high NO2 observations between 2009-2014 form a bridge to older observations.
High NO2 observations after 2014 are not well connected to previous years.
All three observations indicates either a (sufficient) change in state or a lack of similar observations in one year.
[11]:
mdf = melt_features(df, vdims_norm[1:])\
.query("value > -0.5")\
.rename(columns={'feature': 'Highest feature'})
(
shade_edges(paths_year, suffix='year') +
shade_categories(mdf, 'Highest feature', suffix='year',
cat_agg=ds.mean('value'), color_key=feature_key) +
shade_categories(df, 'season', suffix='year', color_key=season_key) +
shade_categories(df, 'periods', suffix='year', color_key=period_key)
).cols(2)
[11]:
The manifold also appears to separate time-periods which have a distinc structure, as shown in the fourth plot above. We can use this structure to interpret how features changed over time. For instance, to see that PM10 started to decrease after 2002.
[12]:
shade_points(
df, vdims, suffix="year", cnorm="eq_hist", colorbar=False, width=200, height=200
).cols(3)
[12]:
Sulfer lens
In both previous embeddings, SO2 appears to explain distinct structures. Here, we apply a local mask over the SO2 values to see if there are other structures within this dimension.
[16]:
start_time = time.perf_counter()
lensed_sulfer = lu.apply_local_mask(
projector,
values=df["SO2 (ug/m3)"],
n_neighbors=20,
repulsion_strength=0.05,
skip_embedding=True,
)
lens_time = time.perf_counter()
# make sure to use previous embedding as initial state
lensed_sulfer.embedding_ = projector.embedding_
lensed_sulfer = lu.embed_graph(lensed_sulfer)
embed_time = time.perf_counter()
store_coordinates(df, lensed_sulfer, suffix="sulfer")
paths_sulfer = edge_paths(lensed_sulfer)
storage_time = time.perf_counter()
print(f"Apply lens: {(lens_time - start_time) * 1000:.2f} microseconds")
print(f"Embed model: {embed_time - lens_time:.2f}s")
print(f"Store coordinates and edges: {storage_time - embed_time:.2f}")
Apply lens: 1028.96 microseconds
Embed model: 48.70s
Store coordinates and edges: 18.74
The resulting embedding appears to consist of (at least) 6 slices that differ in SO2 value and contain observations throughout the year (and across multiple years). This structure hints at some discrete nature within the SO2 dimensions. Looking at the raw-values, it turns out that SO2 is measured in whole (ug/m3) increments, resulting in whole values 1.0, 2.0, … . The dimension’s shape over these values explains the slices. The distribution has a larger tail, where individual slices do not occur because each value has too few observations to keep connectivity within one value.
[13]:
mdf = melt_features(df, vdims_norm[1:]).query("value > -0.5").rename(columns={'feature': 'Highest feature'})
(
shade_edges(paths_sulfer, suffix='sulfer') +
shade_categories(mdf, 'Highest feature', suffix='sulfer', width=380,
cat_agg=ds.mean('value'), color_key=feature_key) +
shade_categories(df, 'season', suffix='sulfer', color_key=season_key) +
shade_points(df, ['year'], suffix='sulfer', width=380)
).cols(2)
[13]:
[14]:
shade_points(df, vdims, suffix="sulfer", cnorm="eq_hist", colorbar=False, width=200, height=200).cols(3)
[14]:
Is time roughly continuous along the manifold?
An important question for the interpretation of the manifold is whether time varies along it continuously. In other words, are successive measurements at the same location close within the embedding? On one hand, we expect the changes to be gradual over time. However, local air quality may change quickly by variance in local emissions. These patterns may make successive days look quite differently.
Here, we draw how locations move over the manifold in time. To do this effectively, we animate how one location moves over time using short tails to indicate previous positions. Due to the number of days within the data, rendering the animation is quite slow. Therefore, we render the frames in parallel baches using the notebooks Animate_Air_Quality_[idx].png
, which each still take more than 2 hours to complete. The videos generated are shown below. Generally, the location’s observations jump
around the manifold quite quickly. Some regions appear to occur more often in particular seasons, but loops in the manifold do not match movements over time generally.
[19]:
# Define days since start for easy comparison
datetime = df.date.apply(pd.Timestamp)
base_date = datetime.min()
days_since = datetime.apply(lambda x: (x - base_date).days)
df["days_since_start"] = days_since.astype("int")
# Export the data.
df.to_parquet('./data/generated/air_quality.parquet')
paths.to_parquet('./data/generated/air_quality_paths.parquet')
paths_year.to_parquet('./data/generated/air_quality_paths_year.parquet')
paths_sulfer.to_parquet('./data/generated/air_quality_paths_sulfer.parquet')
Default UMAP
Year Lens
Sulfer Lens