Tutorial 2¶
Tutorial Example - Lipid class Prediction (continued)¶
This notebook showcases the use of GLC for lipid class prediction for untargeted lipidomic features for a reverse-phase lipidomics LC-MS assay. Building on the quickstart tutorial we extend the workflow by considering additional ion forms in the database matching.
As before, the dataset used in this notebook originates from the AddNeuroMed cohort. The assays were processed by the National Phenome Centre, and the GLC package includes the corresponding untargeted feature table as an example dataset for the LPOS assay with 4,886 features (reversed-phase chromatography in positive ionization mode). The data was prepcoessed using XCMS and nPYc-Toolbox.
# import packages
import glc
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
Feature table requirements¶
For use with GLC, the feature table must contain at minimum:
- a unique feature identifier called (
peak_id) - mass-to-charge ratio or alternatvely neutral_mass for a grouped feature table (
m/z) - retention time in seconds (
rt) - intensity information for each sample
An example of the expected structure is shown below:
| peak_id | m/z | rt | sample_1_intensity | sample_2_intensity |
|---|---|---|---|---|
| 1 | 760.585 | 312.4 | 1.23e6 | 8.3e8 |
| 2 | 782.569 | 298.7 | 8.45e5 | 1.14e2 |
| 3 | 734.569 | 341.2 | 2.10e6 | 9.9e5 |
# load the example feature table
ds = glc.LoadExampleData()
ds.feat_table.head()
| peak_id | mz | mzmin | mzmax | rt | rtmin | rtmax | ALZ_LPOS_ToF05_S10W01 | ALZ_LPOS_ToF05_S10W02 | ALZ_LPOS_ToF05_S10W03 | ... | ALZ_LPOS_ToF05_S18W85 | ALZ_LPOS_ToF05_S18W86 | ALZ_LPOS_ToF05_S18W87 | ALZ_LPOS_ToF05_S18W88 | ALZ_LPOS_ToF05_S18W89 | ALZ_LPOS_ToF05_S18W90 | ALZ_LPOS_ToF05_S18W91 | ALZ_LPOS_ToF05_S18W92 | ALZ_LPOS_ToF05_S18W95_LTR | ALZ_LPOS_ToF05_S18W96_SR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 149.020788 | 149.019306 | 149.023645 | 216.244505 | 214.252996 | 219.310999 | 1090.250052 | 4490.669602 | 29455.387448 | ... | 58331.194080 | 18272.786685 | 36686.550679 | 40101.094219 | 13895.999509 | 18124.479888 | 66918.355458 | 33292.263065 | 6.997904e+06 | 28084.760086 |
| 1 | 2 | 184.070977 | 184.069395 | 184.074130 | 361.097002 | 356.983996 | 365.917997 | 111695.867233 | 125453.901289 | 178557.859139 | ... | 144635.427267 | 167737.194017 | 197555.712231 | 179264.621233 | 197252.251353 | 199957.914864 | 161728.923879 | 167901.055349 | 2.503246e+05 | 187805.589293 |
| 2 | 3 | 184.070739 | 184.069179 | 184.074066 | 473.921499 | 470.621996 | 479.506988 | 487520.978430 | 507689.908146 | 549142.660961 | ... | 552094.714656 | 572367.493929 | 585021.811458 | 607610.827728 | 548663.568528 | 608063.451880 | 587394.901075 | 551502.965857 | 5.248675e+05 | 563743.690681 |
| 3 | 4 | 184.070894 | 184.069490 | 184.074038 | 314.846992 | 310.094003 | 321.786003 | 241696.633518 | 243125.770031 | 338378.917343 | ... | 335716.656628 | 355086.755620 | 360378.519575 | 383559.519936 | 321654.963278 | 381142.385594 | 309565.494361 | 376864.941987 | 2.855071e+05 | 325279.628676 |
| 4 | 5 | 184.070681 | 184.069138 | 184.074102 | 445.146489 | 440.267000 | 448.554010 | 446536.156909 | 475157.552228 | 424077.154410 | ... | 327425.355258 | 454101.046639 | 392653.722571 | 424944.098846 | 360710.931921 | 618157.565794 | 341301.898479 | 526253.164215 | 5.093509e+05 | 525105.873510 |
5 rows × 763 columns
# create a Python object for fast look up of feature rt and m/z
feat_dicts = glc.FeatDicts(ds.feat_table)
# load the estimated GGM for the example data
ggm = glc.GGM(ds.ggm)
# load the LMSD datbase
lm_df = glc.load_lm_database()
# features in GGM main subgraph: 4849 out of 4858
# UMAP embedding of the GGM adjacency matrix for visualisation
umap_embedding_obj = glc.UMAPEmbedder(G=ggm.G)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
# Plot feature m/z and rt on the UMAP embedding
# each point represents a feature in the dataset
umap_potter = glc.UMAPPlots(umap_embedding_obj)
sns.set_style('darkgrid')
fig, axs = plt.subplots(2, 1, figsize=(6, 10))
axs = axs.flatten()
umap_potter.plot_feature_attribute(feat_dicts=feat_dicts, attribute='rt', ax=axs[0])
umap_potter.plot_feature_attribute(feat_dicts=feat_dicts, attribute='mz', ax=axs[1], vmax=1200)
<Axes: xlabel='Emb 1', ylabel='Emb 2'>
Although GLC does not require prior annotations or MS2 data to predict lipid classes, the example dataset includes 183 features with lipid class–level annotations obtained from targeted feature extraction. We use this ground-truth set to investigate how lipid class information is encoded within the GGM structure.
# plot main class annotations in the GGM structure
ground_truth = ds.ground_truth
sns.set_style('whitegrid')
umap_potter.plot_ground_truth(annot_df=ground_truth, class_level='lm_mainclass')
<Axes: xlabel='Emb 1', ylabel='Emb 2'>
umap_potter.plot_ground_truth(annot_df=ground_truth, class_level='lm_subclass')
<Axes: xlabel='Emb 1', ylabel='Emb 2'>
Step 1: Database matching¶
In this tutorial, we extend the basic database matching strategy by considering ion forms beyond [M+H]+.
While including a broader panel of ion forms can increase the number of potential database matches, it also raises the number of incrorect matches. To manage this, we adopt an approach inspired by recent work that focuses on identifying ion forms that are prevalent in the dataset.
The key idea is to examine recurrent mass differences between co-eluting features. These mass differences are often derived from the same compound but differ due to adduct formation or in-source fragmentation.
A simple approach would be to compute all pairwise mass differences between co-eluting features and inspect histograms to find frequently occurring values. For instance, there may be a strong peak corresponding to the mass difference between [M+H]+ and [M+Na]+ ions.
Instead of manual inspection, we formalise this process using kernel density estimation (KDE) combined with a peak-picking approach to systematically detect recurrent mass differences.
For a mass difference between pair of features to be used in this analysis:
- elute within one second of each other
- show a significant partial correlation
Nash et al. applied a similar strategy across 142 LC–MS datasets and curated a reference table of common electrospray-derived mass differences. We use this table to help interpret the detected mass-difference peaks and guide which ion forms are included in database matching.
This workflow is not perfect and should always be interpreted carefully, considering both mass spectrometry principles and the specifics of the experimental setup.
# pick the m/z peak in the GGM structure
mz_peak_picker = glc.PickPeaks(ggm=ggm, feat_dicts=feat_dicts)
Interpreting the observed m/z differences¶
We combine the observed m/z differences with the curated list reported by Nash et al. across 142 studies (shown below).
Some mass differences can be ambiguous. For example, the difference between carbon isotopes (12C and 13C; M+1) and that between lithium isotopes (6Li and 7Li) could both correspond to the same detected m/z peak (mz_peak_id = 11).
Careful interpretation requires consideration of the experimental context. In this experiment, lithium was not added to the mobile phase, and the 13C isotopic difference is expected to be much more common. Additionally, some mass differences-such as those corresponding to H₂ or C₂H₂, can arise from either in-source fragmentation or biological transformations.
Because we only have partial evidence for which ion forms may be present for a given feature, we always include the adducts: [M+H]+ in ESI(+) and [M–H]- in ESI(-) in addition.
From the table below, several ions look interesting so we will plot their mz difference peaks.
glc.MzPeakLookup().identify_mz_diffs(mz_peak_picker.get_peak_df())
| mz_peak_id | Annotation | Annotation class | m/z difference (theoretical) | mz_centre | mzmin | mzmax | width | paper_rank | n_edges | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.0 | Carbon (12C and 13C; M+1) | Isotope | 1.0034 | 1.003999 | 0.997870 | 1.010342 | 0.012472 | 1 | 1782.0 |
| 1 | 11.0 | Lithium (6Li and 7Li) | Isotope | 1.0009 OR 1.0010 | 1.003999 | 0.997870 | 1.010342 | 0.012472 | 15 | 1782.0 |
| 2 | 11.0 | Not annotated | Not annotated | Not annotated | 1.003999 | 0.997870 | 1.010342 | 0.012472 | 33 | 1782.0 |
| 3 | 11.0 | H or H loss to form radical ion | Biological transformation and other | 1.0078 | 1.003999 | 0.997870 | 1.010342 | 0.012472 | 69 | 1782.0 |
| 4 | 11.0 | Not annotated | Not annotated | Not annotated | 1.003999 | 0.997870 | 1.010342 | 0.012472 | 24 | 1782.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 60 | 1.0 | 13C - 12C difference | Multiple charge state | 0.2509 | 0.249875 | 0.245685 | 0.254238 | 0.008552 | 45 | 32.0 |
| 61 | 76.0 | C2H4 | In-source fragment and biological transformation | 28.0313 | 28.028064 | 28.021644 | 28.037301 | 0.015657 | 19 | 28.0 |
| 62 | 76.0 | Not annotated | Not annotated | Not annotated | 28.028064 | 28.021644 | 28.037301 | 0.015657 | 260 | 28.0 |
| 63 | 80.0 | C3H2 | Other | 38.0157 | 38.015998 | 38.012121 | 38.021353 | 0.009232 | 192 | 23.0 |
| 64 | 44.0 | [M+H]+ - [M+NH4]+ difference (or NH3 loss) | Charged adduct | 17.0265 | 17.028666 | 17.026492 | 17.030855 | 0.004364 | 35 | 23.0 |
65 rows × 10 columns
fig, axs = plt.subplots(2, 4, figsize=(20, 10))
axs = axs.flatten()
pad = 0.05
mz_delta = 1.0034
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=6, ax=axs[0])
axs[0].set_title('C13 - C12')
mz_delta = 2.0157
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[1])
axs[1].set_title('H2')
mz_delta = 4.9554
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[2])
axs[2].set_title('[M+NH4]+ - [M+39K]+')
mz_delta = 15.9739
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[3])
axs[3].set_title('[M+Na]+ - [M+39K]+')
mz_delta = 26.0157
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[4])
axs[4].set_title('C2H2')
mz_delta = 21.9819
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[5])
axs[5].set_title('[M+H]+ - [M+Na]+')
mz_delta = 18.0106
mz_peak_picker.plot_detailed(xmin=mz_delta - pad, xmax=mz_delta + pad, ymax=1, ax=axs[6])
axs[6].set_title('H2O')
axs[7].axis('off')
(np.float64(0.0), np.float64(1.0), np.float64(0.0), np.float64(1.0))
We now extend the database matching step to include additional ion forms, such as sodium (Na), potassium (K), and ammonium (NH₄) adducts, as well as in-source fragment losses of H₂O, C₂H₂, and H₂.
These ion forms correspond to mass-difference peaks identified among partially correlated, co-eluting features, and all have been previously reported by Nash et al. as commonly observed in LC–MS data.
We begin by tabulating ion form information behind the relevant mass differences. Next, we match each feature to the database, considering both [M+H]+ and any ion forms from our panel - but importantly only if the feature is associated with the corresponding m/z difference peak. After this, we account for 13C-12C isotopic differences. Finally, we filter out feature–structure matches with extreme LogP values by modeling retention time versus LogP using monotonic regression.
esi_data = {
'formula': [
'H',
'Na',
'K',
'NH4',
'H2O',
'C2H2',
'H2'
],
'type_code': [
'charged',
'charged',
'charged',
'charged',
'isf',
'isf',
'isf'
]
}
esi_df = pd.DataFrame(esi_data)
feat2lmids = glc.map_ungrouped_feature_table(
proton_only=False,
db=lm_df,
feat_dicts=feat_dicts,
ion_mode='pos',
esi_df=esi_df,
mz_peak_picker_obj=mz_peak_picker,
c13_adjustment=True,
logp_filter_thresh=3
)
/Users/tomrix/Library/CloudStorage/OneDrive-ImperialCollegeLondon/projects/glc/src/glc/ion_processor.py:174: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat(dfs, axis=0).reset_index(drop=True) 100%|██████████| 4886/4886 [00:00<00:00, 284235.36it/s] 100%|██████████| 47921/47921 [00:06<00:00, 7456.75it/s] Adjusting for C13 isotope components: 100%|██████████| 1439/1439 [00:00<00:00, 4384.48it/s]
Step 2: GLC Lipid Class Prediction¶
Next, we create the GLC model, which combines multiple feature-to-structure matches with the GGM network to predict lipid classes.
The glc.GLCModel is used to predict lipid subclasses, and we use the LMSD ontology to derive each feature’s predicted lipid main class.
As output, the model produces a dictionary mapping each feature’s peak_id to a ranked list of the top 10 predicted lipid classes.
model = glc.GLCModel(
ggm=ggm,
feat_dicts=feat_dicts,
node_ids=feat2lmids,
db_df=lm_df
)
subclass_predictions = model.predict_all()
mainclass_predictions = model.convert2mainclass(subclass_predictions)
Predicting feature classes: 100%|██████████| 4849/4849 [00:42<00:00, 113.31it/s]
# plot predictions for feature 173 which is known to be a fatty acyl carnitine from the ground-truth
subclasses, scores = zip(*subclass_predictions[173])
# Figure size
plt.figure(figsize=(10, 6))
# Horizontal barplot
plt.barh(subclasses, scores)
# Labeling
plt.xlabel("GLC Score")
plt.title(f"GLC Predictions for Feature 173: m/z={feat_dicts.mz[173]:.4f}, rt={feat_dicts.rt[173]:.1f}s")
plt.gca().invert_yaxis() # Highest value at the top
plt.tight_layout()
plt.show()
Now we will do multi-class classification against the ground-truth
# Turn off future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
eval_df = glc.test_annotations(predictions=subclass_predictions, annot_df=ground_truth)
evaluator = glc.PlotConfusionMatrices(eval_df)
metrics = evaluator.peformance_metrics(return_df=True)
# view subclass classification performance
metrics.tail(3)
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| micro avg | 0.906977 | 0.852459 | 0.878873 | 183.0 |
| macro avg | 0.738854 | 0.700288 | 0.705480 | 183.0 |
| weighted avg | 0.885545 | 0.852459 | 0.860210 | 183.0 |
metrics = evaluator.peformance_metrics(return_df=True, level='main')
# view subclass classification performance
metrics.tail(3)
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| micro avg | 0.911602 | 0.901639 | 0.906593 | 183.0 |
| macro avg | 0.895772 | 0.834462 | 0.837763 | 183.0 |
| weighted avg | 0.929794 | 0.901639 | 0.906699 | 183.0 |
# plot confusion matrix
evaluator.plot_annotation_confusion_matrix()
(<Figure size 1000x800 with 2 Axes>, <Axes: xlabel='Predicted sub class', ylabel='True sub class'>)
Compared to the quickstart tutorial, this workflow leads to improved prediction performance. Subclass-level accuracy (reported here as micro-average recall) increases by approximately 10%, while main-class accuracy improves by around 7% compared to just considering [M+H]+.
Finally, we compute quality scores and collate all results into a single DataFrame, which can be saved using:
results.to_csv('add_your_save_path/glc_results.csv')
qs_df = glc.QualityScorer(subclass_predictions, embedder_obj=umap_embedding_obj).df
qs_df.sample(5)
| feature | subclass | lsi_score | pcor_score | product_score | |
|---|---|---|---|---|---|
| 3080 | 3104 | 1-(1Z-alkenyl),2-acylglycerophosphocholines [G... | 0.722222 | 0.149005 | 0.274338 |
| 4366 | 4404 | Triacylglycerols [GL0301] | 1.000000 | 0.989432 | 0.992003 |
| 3596 | 3627 | 1-alkyl,2-acylglycerophosphocholines [GP0102] | 0.277778 | 0.498259 | 0.335611 |
| 2316 | 2334 | Diacylglycerophosphocholines [GP0101] | 0.722222 | 0.437505 | 0.567439 |
| 1244 | 1255 | N-acylsphingosines (ceramides) [SP0201] | 0.500000 | 0.012314 | 0.017413 |
results = glc.build_prediction_dataframe(
subclass_predictions=subclass_predictions,
mainclass_predictions=mainclass_predictions,
quality_score_df=qs_df,
feat_dicts=feat_dicts
)
results.sample(5)
| peak_id | mz | rt | subclass | mainclass | lsi_score | pcor_score | product_score | |
|---|---|---|---|---|---|---|---|---|
| 4413 | 4451 | 939.780671 | 677.527026 | Triacylglycerols [GL0301] | Triradylglycerols [GL03] | 0.722222 | 0.870890 | 0.768384 |
| 2228 | 2245 | 753.420270 | 175.654006 | Oxidized glycerophosphoinositols [GP2005] | Oxidized glycerophospholipids [GP20] | 0.722222 | 0.478336 | 0.585182 |
| 2393 | 2411 | 768.552819 | 448.804507 | Diacylglycerophosphocholines [GP0101] | Glycerophosphocholines [GP01] | 1.000000 | 0.184325 | 0.419957 |
| 4235 | 4272 | 914.818451 | 649.724004 | Triacylglycerols [GL0301] | Triradylglycerols [GL03] | 1.000000 | 0.762333 | 0.852049 |
| 680 | 684 | 524.902805 | 148.708005 | Monoacylglycerophosphocholines [GP0105] | Glycerophosphocholines [GP01] | 0.333333 | 0.444896 | 0.358335 |