Tutorial 2¶

Tutorial Example - Lipid class Prediction (continued)¶

This notebook showcases the use of GLC for lipid class prediction for untargeted lipidomic features for a reverse-phase lipidomics LC-MS assay. Building on the quickstart tutorial we extend the workflow by considering additional ion forms in the database matching.

As before, the dataset used in this notebook originates from the AddNeuroMed cohort. The assays were processed by the National Phenome Centre, and the GLC package includes the corresponding untargeted feature table as an example dataset for the LPOS assay with 4,886 features (reversed-phase chromatography in positive ionization mode). The data was prepcoessed using XCMS and nPYc-Toolbox.

In [1]:

Copied!





# import packages

import glc

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# import packages

import glc

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Feature table requirements¶

For use with GLC, the feature table must contain at minimum:

a unique feature identifier called (peak_id)
mass-to-charge ratio or alternatvely neutral_mass for a grouped feature table (m/z)
retention time in seconds (rt)
intensity information for each sample

An example of the expected structure is shown below:

peak_id	m/z	rt	sample_1_intensity	sample_2_intensity
1	760.585	312.4	1.23e6	8.3e8
2	782.569	298.7	8.45e5	1.14e2
3	734.569	341.2	2.10e6	9.9e5

In [2]:

Copied!

# load the example feature table

ds = glc.LoadExampleData()
ds.feat_table.head()
# load the example feature table

ds = glc.LoadExampleData()
ds.feat_table.head()

Out[2]:

	peak_id	mz	mzmin	mzmax	rt	rtmin	rtmax	ALZ_LPOS_ToF05_S10W01	ALZ_LPOS_ToF05_S10W02	ALZ_LPOS_ToF05_S10W03	...	ALZ_LPOS_ToF05_S18W85	ALZ_LPOS_ToF05_S18W86	ALZ_LPOS_ToF05_S18W87	ALZ_LPOS_ToF05_S18W88	ALZ_LPOS_ToF05_S18W89	ALZ_LPOS_ToF05_S18W90	ALZ_LPOS_ToF05_S18W91	ALZ_LPOS_ToF05_S18W92	ALZ_LPOS_ToF05_S18W95_LTR	ALZ_LPOS_ToF05_S18W96_SR
0	1	149.020788	149.019306	149.023645	216.244505	214.252996	219.310999	1090.250052	4490.669602	29455.387448	...	58331.194080	18272.786685	36686.550679	40101.094219	13895.999509	18124.479888	66918.355458	33292.263065	6.997904e+06	28084.760086
1	2	184.070977	184.069395	184.074130	361.097002	356.983996	365.917997	111695.867233	125453.901289	178557.859139	...	144635.427267	167737.194017	197555.712231	179264.621233	197252.251353	199957.914864	161728.923879	167901.055349	2.503246e+05	187805.589293
2	3	184.070739	184.069179	184.074066	473.921499	470.621996	479.506988	487520.978430	507689.908146	549142.660961	...	552094.714656	572367.493929	585021.811458	607610.827728	548663.568528	608063.451880	587394.901075	551502.965857	5.248675e+05	563743.690681
3	4	184.070894	184.069490	184.074038	314.846992	310.094003	321.786003	241696.633518	243125.770031	338378.917343	...	335716.656628	355086.755620	360378.519575	383559.519936	321654.963278	381142.385594	309565.494361	376864.941987	2.855071e+05	325279.628676
4	5	184.070681	184.069138	184.074102	445.146489	440.267000	448.554010	446536.156909	475157.552228	424077.154410	...	327425.355258	454101.046639	392653.722571	424944.098846	360710.931921	618157.565794	341301.898479	526253.164215	5.093509e+05	525105.873510

5 rows × 763 columns

In [3]:

Copied!





# create a Python object for fast look up of feature rt and m/z
feat_dicts = glc.FeatDicts(ds.feat_table)

# load the estimated GGM for the example data
ggm = glc.GGM(ds.ggm)

# load the LMSD datbase
lm_df = glc.load_lm_database()
# create a Python object for fast look up of feature rt and m/z
feat_dicts = glc.FeatDicts(ds.feat_table)

# load the estimated GGM for the example data
ggm = glc.GGM(ds.ggm)

# load the LMSD datbase
lm_df = glc.load_lm_database()

# features in GGM main subgraph: 4849 out of 4858

In [4]:

Copied!

# UMAP embedding of the GGM adjacency matrix for visualisation

umap_embedding_obj = glc.UMAPEmbedder(G=ggm.G)
# UMAP embedding of the GGM adjacency matrix for visualisation

umap_embedding_obj = glc.UMAPEmbedder(G=ggm.G)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

In [5]:

Copied!





# Plot feature m/z and rt on the UMAP embedding
# each point represents a feature in the dataset

umap_potter = glc.UMAPPlots(umap_embedding_obj)

sns.set_style('darkgrid')
fig, axs = plt.subplots(2, 1, figsize=(6, 10))
axs = axs.flatten()

umap_potter.plot_feature_attribute(feat_dicts=feat_dicts, attribute='rt', ax=axs[0])
umap_potter.plot_feature_attribute(feat_dicts=feat_dicts, attribute='mz', ax=axs[1], vmax=1200)
# Plot feature m/z and rt on the UMAP embedding
# each point represents a feature in the dataset

umap_potter = glc.UMAPPlots(umap_embedding_obj)

sns.set_style('darkgrid')
fig, axs = plt.subplots(2, 1, figsize=(6, 10))
axs = axs.flatten()

umap_potter.plot_feature_attribute(feat_dicts=feat_dicts, attribute='rt', ax=axs[0])
umap_potter.plot_feature_attribute(feat_dicts=feat_dicts, attribute='mz', ax=axs[1], vmax=1200)

Out[5]:

<Axes: xlabel='Emb 1', ylabel='Emb 2'>

No description has been provided for this image

Although GLC does not require prior annotations or MS2 data to predict lipid classes, the example dataset includes 183 features with lipid class–level annotations obtained from targeted feature extraction. We use this ground-truth set to investigate how lipid class information is encoded within the GGM structure.

In [6]:

Copied!

# plot main class annotations in the GGM structure

ground_truth = ds.ground_truth

sns.set_style('whitegrid')
umap_potter.plot_ground_truth(annot_df=ground_truth, class_level='lm_mainclass')
# plot main class annotations in the GGM structure

ground_truth = ds.ground_truth

sns.set_style('whitegrid')
umap_potter.plot_ground_truth(annot_df=ground_truth, class_level='lm_mainclass')

Out[6]:

<Axes: xlabel='Emb 1', ylabel='Emb 2'>

In [7]:

Copied!

umap_potter.plot_ground_truth(annot_df=ground_truth, class_level='lm_subclass')
umap_potter.plot_ground_truth(annot_df=ground_truth, class_level='lm_subclass')

Out[7]:

<Axes: xlabel='Emb 1', ylabel='Emb 2'>

Step 1: Database matching¶

In this tutorial, we extend the basic database matching strategy by considering ion forms beyond [M+H]+.

While including a broader panel of ion forms can increase the number of potential database matches, it also raises the number of incrorect matches. To manage this, we adopt an approach inspired by recent work that focuses on identifying ion forms that are prevalent in the dataset.

The key idea is to examine recurrent mass differences between co-eluting features. These mass differences are often derived from the same compound but differ due to adduct formation or in-source fragmentation.

A simple approach would be to compute all pairwise mass differences between co-eluting features and inspect histograms to find frequently occurring values. For instance, there may be a strong peak corresponding to the mass difference between [M+H]+ and [M+Na]+ ions.

Instead of manual inspection, we formalise this process using kernel density estimation (KDE) combined with a peak-picking approach to systematically detect recurrent mass differences.

For a mass difference between pair of features to be used in this analysis:

elute within one second of each other
show a significant partial correlation

Nash et al. applied a similar strategy across 142 LC–MS datasets and curated a reference table of common electrospray-derived mass differences. We use this table to help interpret the detected mass-difference peaks and guide which ion forms are included in database matching.

This workflow is not perfect and should always be interpreted carefully, considering both mass spectrometry principles and the specifics of the experimental setup.

In [8]:

Copied!

# pick the m/z peak in the GGM structure
mz_peak_picker = glc.PickPeaks(ggm=ggm, feat_dicts=feat_dicts)
# pick the m/z peak in the GGM structure
mz_peak_picker = glc.PickPeaks(ggm=ggm, feat_dicts=feat_dicts)

Interpreting the observed m/z differences¶

We combine the observed m/z differences with the curated list reported by Nash et al. across 142 studies (shown below).

Some mass differences can be ambiguous. For example, the difference between carbon isotopes (12C and 13C; M+1) and that between lithium isotopes (6Li and 7Li) could both correspond to the same detected m/z peak (mz_peak_id = 11).

Careful interpretation requires consideration of the experimental context. In this experiment, lithium was not added to the mobile phase, and the 13C isotopic difference is expected to be much more common. Additionally, some mass differences-such as those corresponding to H₂ or C₂H₂, can arise from either in-source fragmentation or biological transformations.

Because we only have partial evidence for which ion forms may be present for a given feature, we always include the adducts: [M+H]+ in ESI(+) and [M–H]- in ESI(-) in addition.

From the table below, several ions look interesting so we will plot their mz difference peaks.

In [9]:

Copied!

glc.MzPeakLookup().identify_mz_diffs(mz_peak_picker.get_peak_df())
glc.MzPeakLookup().identify_mz_diffs(mz_peak_picker.get_peak_df())

Out[9]:

	mz_peak_id	Annotation	Annotation class	m/z difference (theoretical)	mz_centre	mzmin	mzmax	width	paper_rank	n_edges
0	11.0	Carbon (12C and 13C; M+1)	Isotope	1.0034	1.003999	0.997870	1.010342	0.012472	1	1782.0
1	11.0	Lithium (6Li and 7Li)	Isotope	1.0009 OR 1.0010	1.003999	0.997870	1.010342	0.012472	15	1782.0
2	11.0	Not annotated	Not annotated	Not annotated	1.003999	0.997870	1.010342	0.012472	33	1782.0
3	11.0	H or H loss to form radical ion	Biological transformation and other	1.0078	1.003999	0.997870	1.010342	0.012472	69	1782.0
4	11.0	Not annotated	Not annotated	Not annotated	1.003999	0.997870	1.010342	0.012472	24	1782.0
...	...	...	...	...	...	...	...	...	...	...
60	1.0	13C - 12C difference	Multiple charge state	0.2509	0.249875	0.245685	0.254238	0.008552	45	32.0
61	76.0	C2H4	In-source fragment and biological transformation	28.0313	28.028064	28.021644	28.037301	0.015657	19	28.0
62	76.0	Not annotated	Not annotated	Not annotated	28.028064	28.021644	28.037301	0.015657	260	28.0
63	80.0	C3H2	Other	38.0157	38.015998	38.012121	38.021353	0.009232	192	23.0
64	44.0	[M+H]+ - [M+NH4]+ difference (or NH3 loss)	Charged adduct	17.0265	17.028666	17.026492	17.030855	0.004364	35	23.0

65 rows × 10 columns

In [10]:

Copied!





fig, axs = plt.subplots(2, 4, figsize=(20, 10))
axs = axs.flatten()

pad = 0.05

mz_delta = 1.0034
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=6, ax=axs[0])
axs[0].set_title('C13 - C12')

mz_delta = 2.0157
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[1])
axs[1].set_title('H2')

mz_delta = 4.9554
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[2])
axs[2].set_title('[M+NH4]+ - [M+39K]+')

mz_delta = 15.9739
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[3])
axs[3].set_title('[M+Na]+ - [M+39K]+')

mz_delta = 26.0157
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[4])
axs[4].set_title('C2H2')

mz_delta = 21.9819
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[5])
axs[5].set_title('[M+H]+ - [M+Na]+')

mz_delta = 18.0106
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[6])
axs[6].set_title('H2O')

axs[7].axis('off')
fig, axs = plt.subplots(2, 4, figsize=(20, 10))
axs = axs.flatten()

pad = 0.05

mz_delta = 1.0034
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=6, ax=axs[0])
axs[0].set_title('C13 - C12')

mz_delta = 2.0157
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[1])
axs[1].set_title('H2')

mz_delta = 4.9554
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[2])
axs[2].set_title('[M+NH4]+ - [M+39K]+')

mz_delta = 15.9739
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[3])
axs[3].set_title('[M+Na]+ - [M+39K]+')

mz_delta = 26.0157
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[4])
axs[4].set_title('C2H2')

mz_delta = 21.9819
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[5])
axs[5].set_title('[M+H]+ - [M+Na]+')

mz_delta = 18.0106
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[6])
axs[6].set_title('H2O')

axs[7].axis('off')

Out[10]:

(np.float64(0.0), np.float64(1.0), np.float64(0.0), np.float64(1.0))

We now extend the database matching step to include additional ion forms, such as sodium (Na), potassium (K), and ammonium (NH₄) adducts, as well as in-source fragment losses of H₂O, C₂H₂, and H₂.

These ion forms correspond to mass-difference peaks identified among partially correlated, co-eluting features, and all have been previously reported by Nash et al. as commonly observed in LC–MS data.

We begin by tabulating ion form information behind the relevant mass differences. Next, we match each feature to the database, considering both [M+H]+ and any ion forms from our panel - but importantly only if the feature is associated with the corresponding m/z difference peak. After this, we account for 13C-12C isotopic differences. Finally, we filter out feature–structure matches with extreme LogP values by modeling retention time versus LogP using monotonic regression.

In [11]:

Copied!





esi_data = {
    'formula': [
        'H',
        'Na',
        'K',
        'NH4',
        'H2O',
        'C2H2',
        'H2'

    ], 
    'type_code': [
        'charged',
        'charged',
        'charged',
        'charged',
        'isf',
        'isf',
        'isf'
    ]
}
esi_df = pd.DataFrame(esi_data)
esi_data = {
    'formula': [
        'H',
        'Na',
        'K',
        'NH4',
        'H2O',
        'C2H2',
        'H2'

    ], 
    'type_code': [
        'charged',
        'charged',
        'charged',
        'charged',
        'isf',
        'isf',
        'isf'
    ]
}
esi_df = pd.DataFrame(esi_data)

In [12]:

Copied!





feat2lmids = glc.map_ungrouped_feature_table(
    proton_only=False,
    db=lm_df, 
    feat_dicts=feat_dicts, 
    ion_mode='pos', 
    esi_df=esi_df, 
    mz_peak_picker_obj=mz_peak_picker, 
    c13_adjustment=True, 
    logp_filter_thresh=3
)
feat2lmids = glc.map_ungrouped_feature_table(
    proton_only=False,
    db=lm_df, 
    feat_dicts=feat_dicts, 
    ion_mode='pos', 
    esi_df=esi_df, 
    mz_peak_picker_obj=mz_peak_picker, 
    c13_adjustment=True, 
    logp_filter_thresh=3
)

/Users/tomrix/Library/CloudStorage/OneDrive-ImperialCollegeLondon/projects/glc/src/glc/ion_processor.py:174: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat(dfs, axis=0).reset_index(drop=True)
100%|██████████| 4886/4886 [00:00<00:00, 284235.36it/s]
100%|██████████| 47921/47921 [00:06<00:00, 7456.75it/s]
Adjusting for C13 isotope components: 100%|██████████| 1439/1439 [00:00<00:00, 4384.48it/s]

Step 2: GLC Lipid Class Prediction¶

Next, we create the GLC model, which combines multiple feature-to-structure matches with the GGM network to predict lipid classes.

The glc.GLCModel is used to predict lipid subclasses, and we use the LMSD ontology to derive each feature’s predicted lipid main class.

As output, the model produces a dictionary mapping each feature’s peak_id to a ranked list of the top 10 predicted lipid classes.

In [13]:

Copied!





model = glc.GLCModel(
    ggm=ggm, 
    feat_dicts=feat_dicts, 
    node_ids=feat2lmids, 
    db_df=lm_df
)
subclass_predictions = model.predict_all()
mainclass_predictions = model.convert2mainclass(subclass_predictions)
model = glc.GLCModel(
    ggm=ggm, 
    feat_dicts=feat_dicts, 
    node_ids=feat2lmids, 
    db_df=lm_df
)
subclass_predictions = model.predict_all()
mainclass_predictions = model.convert2mainclass(subclass_predictions)

Predicting feature classes: 100%|██████████| 4849/4849 [00:42<00:00, 113.31it/s]

In [14]:

Copied!





# plot predictions for feature 173 which is known to be a fatty acyl carnitine from the ground-truth

subclasses, scores = zip(*subclass_predictions[173])
# Figure size
plt.figure(figsize=(10, 6))

# Horizontal barplot
plt.barh(subclasses, scores)

# Labeling
plt.xlabel("GLC Score")
plt.title(f"GLC Predictions for Feature 173: m/z={feat_dicts.mz[173]:.4f}, rt={feat_dicts.rt[173]:.1f}s")
plt.gca().invert_yaxis()  # Highest value at the top

plt.tight_layout()
plt.show()
# plot predictions for feature 173 which is known to be a fatty acyl carnitine from the ground-truth

subclasses, scores = zip(*subclass_predictions[173])
# Figure size
plt.figure(figsize=(10, 6))

# Horizontal barplot
plt.barh(subclasses, scores)

# Labeling
plt.xlabel("GLC Score")
plt.title(f"GLC Predictions for Feature 173: m/z={feat_dicts.mz[173]:.4f}, rt={feat_dicts.rt[173]:.1f}s")
plt.gca().invert_yaxis()  # Highest value at the top

plt.tight_layout()
plt.show()

Now we will do multi-class classification against the ground-truth

In [15]:

Copied!





# Turn off future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

eval_df = glc.test_annotations(predictions=subclass_predictions, annot_df=ground_truth)
evaluator = glc.PlotConfusionMatrices(eval_df)
metrics = evaluator.peformance_metrics(return_df=True)
# view subclass classification performance
metrics.tail(3)
# Turn off future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

eval_df = glc.test_annotations(predictions=subclass_predictions, annot_df=ground_truth)
evaluator = glc.PlotConfusionMatrices(eval_df)
metrics = evaluator.peformance_metrics(return_df=True)
# view subclass classification performance
metrics.tail(3)

Out[15]:

	precision	recall	f1-score	support
micro avg	0.906977	0.852459	0.878873	183.0
macro avg	0.738854	0.700288	0.705480	183.0
weighted avg	0.885545	0.852459	0.860210	183.0

In [16]:

Copied!

metrics = evaluator.peformance_metrics(return_df=True, level='main')
# view subclass classification performance
metrics.tail(3)
metrics = evaluator.peformance_metrics(return_df=True, level='main')
# view subclass classification performance
metrics.tail(3)

Out[16]:

	precision	recall	f1-score	support
micro avg	0.911602	0.901639	0.906593	183.0
macro avg	0.895772	0.834462	0.837763	183.0
weighted avg	0.929794	0.901639	0.906699	183.0

In [17]:

Copied!

# plot confusion matrix
evaluator.plot_annotation_confusion_matrix()
# plot confusion matrix
evaluator.plot_annotation_confusion_matrix()

Out[17]:

(<Figure size 1000x800 with 2 Axes>,
 <Axes: xlabel='Predicted sub class', ylabel='True sub class'>)

Compared to the quickstart tutorial, this workflow leads to improved prediction performance. Subclass-level accuracy (reported here as micro-average recall) increases by approximately 10%, while main-class accuracy improves by around 7% compared to just considering [M+H]+.

Finally, we compute quality scores and collate all results into a single DataFrame, which can be saved using:

results.to_csv('add_your_save_path/glc_results.csv')

In [18]:

Copied!

qs_df = glc.QualityScorer(subclass_predictions, embedder_obj=umap_embedding_obj).df
qs_df.sample(5)
qs_df = glc.QualityScorer(subclass_predictions, embedder_obj=umap_embedding_obj).df
qs_df.sample(5)

Out[18]:

	feature	subclass	lsi_score	pcor_score	product_score
3080	3104	1-(1Z-alkenyl),2-acylglycerophosphocholines [G...	0.722222	0.149005	0.274338
4366	4404	Triacylglycerols [GL0301]	1.000000	0.989432	0.992003
3596	3627	1-alkyl,2-acylglycerophosphocholines [GP0102]	0.277778	0.498259	0.335611
2316	2334	Diacylglycerophosphocholines [GP0101]	0.722222	0.437505	0.567439
1244	1255	N-acylsphingosines (ceramides) [SP0201]	0.500000	0.012314	0.017413

In [19]:

Copied!





results = glc.build_prediction_dataframe(
    subclass_predictions=subclass_predictions,
    mainclass_predictions=mainclass_predictions,
    quality_score_df=qs_df,
    feat_dicts=feat_dicts
)

results.sample(5)
results = glc.build_prediction_dataframe(
    subclass_predictions=subclass_predictions,
    mainclass_predictions=mainclass_predictions,
    quality_score_df=qs_df,
    feat_dicts=feat_dicts
)

results.sample(5)

Out[19]:

	peak_id	mz	rt	subclass	mainclass	lsi_score	pcor_score	product_score
4413	4451	939.780671	677.527026	Triacylglycerols [GL0301]	Triradylglycerols [GL03]	0.722222	0.870890	0.768384
2228	2245	753.420270	175.654006	Oxidized glycerophosphoinositols [GP2005]	Oxidized glycerophospholipids [GP20]	0.722222	0.478336	0.585182
2393	2411	768.552819	448.804507	Diacylglycerophosphocholines [GP0101]	Glycerophosphocholines [GP01]	1.000000	0.184325	0.419957
4235	4272	914.818451	649.724004	Triacylglycerols [GL0301]	Triradylglycerols [GL03]	1.000000	0.762333	0.852049
680	684	524.902805	148.708005	Monoacylglycerophosphocholines [GP0105]	Glycerophosphocholines [GP01]	0.333333	0.444896	0.358335