Skip to content

Quality Scores

glc.QualityScorer

Compute quality scores for features based on prediction confidence and local neighborhood consistency in UMAP space.

The following scores are computed:

  • PCOR score: GLC score of the top ranking prediction over the sum of the scores of the top 10 ranking. Noramlized to [0, 1] via quantile transformation. Called the PCOR score (partial correlation score) as GLC scoring is based on partial correlations. Higher GLC scores for a feature tend to results from a combination of i.) more database matches and ii.) stronger partial correlations For the PCOR score, we are seeking to capture how dominant the top prediction is compared to other predictions for that feature.

  • LSI score: Local Simpson's Index measuring the diversity of subclass predictions among k-nearest neighbors in UMAP space of the GGM. Based on the assumption that GGM structure encodes lipid class. We expect that nearest features have the same lipid class.

  • Product score: Product of PCOR and LSI scores, quantile-scaled to [0, 1].

  • The quality scores should be interpreted as the higher the score, the higher the GLC prediction confidence for that feature.

  • However, note of caution, these quality scores are dataset specific and cannot be directly compared across datasets.
Source code in src/glc/quality_scores.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
class QualityScorer:

    """
    Compute quality scores for features based on prediction confidence and
    local neighborhood consistency in UMAP space.

    The following scores are computed:

    - PCOR score:
        GLC score of the top ranking prediction over the sum of the scores of the top 10 ranking.
        Noramlized to [0, 1] via quantile transformation.
        Called the PCOR score (partial correlation score) as GLC scoring is based on partial correlations. 
        Higher GLC scores for a feature tend to results from a combination of i.) more database matches and ii.) stronger partial correlations 
        For the PCOR score, we are seeking to capture how dominant the top prediction is compared to other predictions for that feature.

    - LSI score:
        Local Simpson's Index measuring the diversity of subclass predictions
        among k-nearest neighbors in UMAP space of the GGM.
        Based on the assumption that GGM structure encodes lipid class. We expect that nearest features have the same lipid class. 

    - Product score:
        Product of PCOR and LSI scores, quantile-scaled to [0, 1].

    - The quality scores should be interpreted as the higher the score, the higher the GLC prediction confidence for that feature. 
    - However, note of caution, these quality scores are dataset specific and cannot be directly compared across datasets.

    """

    def __init__(
        self,
        prediction_dict: Dict[int, List[Tuple[str, float]]],
        embedder_obj: UMAPEmbedder,
        k: int = 5,
    ) -> None:

        """
        Initialize the QualityScorer and compute all quality metrics.

        Upon initialization, the following steps are performed:
        - Encode subclass labels
        - Compute k-nearest neighbors in UMAP space
        - Calculate LSI, PCOR, and product quality scores
        - Assemble the results into a single DataFrame

        The primary output of this class is the `df` attribute, which contains the quality scores for each feature.

        Args:
            prediction_dict:
                GLC predictions output. A mapping from feature to subclass prediction.
            embedder_obj:
                Fitted UMAPEmbedder object of the GGM structure.
            k:
                Number of neighbors used for local diversity calculations.

        Attributes:
            df (pd.DataFrame):
                DataFrame with one row per feature and the following columns:
                - feature: Feature identifier
                - subclass: Top predicted subclass
                - lsi_score: Local Simpson's Index score
                - pcor_score: Quantile-scaled PCOR score
                - product_score: Combined quality score in [0, 1]
        """

        self.prediction_dict = {}
        for key, val in prediction_dict.items():
            if val == []: # unless using label propagation in the GLC model, a very very small number of features may have no predictions
                self.prediction_dict[key] = [(np.nan, np.nan)]
            else:
                self.prediction_dict[key] = val


        self.emb = embedder_obj.umap_embedding
        self.k = k + 1 # +1, because the first one is the feature itself
        self.node_names = embedder_obj.node_names # node names are features/peak_id

        self.labels_encoded, self.label2code = self._encode_labels()
        self.distances, self.indices = self._knn()
        self.df = self._create_df()

    def _encode_labels(self) -> Tuple[np.ndarray, Dict[str, int]]:
        """
        Encode top predicted subclass labels as integers.

        Returns:
            Tuple containing:
            - Array of encoded subclass labels aligned with node order.
            - Mapping from subclass name to encoded code
        """

        labels = [self.prediction_dict[feat][0][0] for feat in self.node_names]
        label_encoder = LabelEncoder()
        labels_encoded = label_encoder.fit_transform(labels)
        label2code = {label: i for i, label in enumerate(label_encoder.classes_)}
        return labels_encoded, label2code

    def _knn(self) -> Tuple[np.ndarray, np.ndarray]:
        """
        Perform k-nearest neighbors search on UMAP embedding of the GGM. 

        Returns:
            Tuple[np.ndarray, np.ndarray]: Distances and indices of nearest neighbors.
        """
        scaled_data = StandardScaler().fit_transform(self.emb)
        knn = NearestNeighbors(n_neighbors=self.k, algorithm='auto').fit(scaled_data)
        distances, indices = knn.kneighbors(scaled_data)
        return distances, indices

    def _score_all_lsi(self) -> List[float]:
        """
        Compute Local Simpson's Index (LSI) for all features.

        LSI measures the concentration of subclass labels among local neighbors.
        Higher values indicate lower local diversity and higher confidence in prediction. 

        Returns:
            List of LSI scores, one per feature.
        """
        lsi_scores = []
        for neighbor_idx in self.indices:
            neighbor_labels = self.labels_encoded[neighbor_idx]
            label_counts = np.bincount(neighbor_labels)
            simpson_index = np.sum((label_counts / self.k) ** 2)
            lsi_scores.append(simpson_index)
        return lsi_scores

    def _get_weighted_pcor_scores(
        self, k: int
    ) -> Dict[int, List[Tuple[int, float]]]:
        """
        Normalize partial-correlation scores for the top-k predictions.

        Args:
            k:
                Number of top subclass predictions to consider per feature.

        Returns:
            Dictionary mapping feature identifier to a list of
            (subclass, normalized_score) tuples.
        """
        result = {}
        for peak, preds in self.prediction_dict.items():
            # Only keep the first k tuples
            preds_k = preds[:k]

            total = sum(score for _, score in preds_k)
            if total == 0:
                # If sum is zero, assign np.nan for all subclasses in the first k
                result[peak] = [(subclass, np.nan) for subclass, _ in preds_k]
            else:
                result[peak] = [(subclass, score / total) for subclass, score in preds_k]
        return result


    def _scale_pcor_scores(self) -> np.ndarray:
        """
        Extract and quantile-scale the PCOR score for eachfeature.

        Returns:
            Array of PCOR scores scaled to a uniform [0, 1] distribution.
        """
        pcor_score_dict = self._get_weighted_pcor_scores(k=10)
        scores = [pcor_score_dict[feat][0][1] for feat in self.node_names]
        scores = np.array(scores).reshape(-1, 1)
        quantile_scores = QuantileTransformer(output_distribution='uniform').fit_transform(scores)
        return quantile_scores.flatten()


    def _create_df(self) -> pd.DataFrame:
        """
        Assemble the final quality score DataFrame.

        Returns:
            DataFrame with columns:
            - feature
            - subclass
            - lsi_score
            - pcor_score
            - product_score
        """
        lsi_score = self._score_all_lsi()
        pcor_norm_score = self._scale_pcor_scores()
        product_score = np.array(lsi_score) * pcor_norm_score
        product_score = QuantileTransformer(output_distribution='uniform').fit_transform(product_score.reshape(-1, 1)).flatten()
        return pd.DataFrame(
            {
                'feature': self.node_names,
                'subclass': [self.prediction_dict[feat][0][0] for feat in self.node_names],
                'lsi_score': lsi_score,
                'pcor_score': pcor_norm_score,
                'product_score': product_score
            }
        )

__init__(prediction_dict, embedder_obj, k=5)

Initialize the QualityScorer and compute all quality metrics.

Upon initialization, the following steps are performed: - Encode subclass labels - Compute k-nearest neighbors in UMAP space - Calculate LSI, PCOR, and product quality scores - Assemble the results into a single DataFrame

The primary output of this class is the df attribute, which contains the quality scores for each feature.

Parameters:

Name Type Description Default
prediction_dict Dict[int, List[Tuple[str, float]]]

GLC predictions output. A mapping from feature to subclass prediction.

required
embedder_obj UMAPEmbedder

Fitted UMAPEmbedder object of the GGM structure.

required
k int

Number of neighbors used for local diversity calculations.

5

Attributes:

Name Type Description
df DataFrame

DataFrame with one row per feature and the following columns: - feature: Feature identifier - subclass: Top predicted subclass - lsi_score: Local Simpson's Index score - pcor_score: Quantile-scaled PCOR score - product_score: Combined quality score in [0, 1]

Source code in src/glc/quality_scores.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def __init__(
    self,
    prediction_dict: Dict[int, List[Tuple[str, float]]],
    embedder_obj: UMAPEmbedder,
    k: int = 5,
) -> None:

    """
    Initialize the QualityScorer and compute all quality metrics.

    Upon initialization, the following steps are performed:
    - Encode subclass labels
    - Compute k-nearest neighbors in UMAP space
    - Calculate LSI, PCOR, and product quality scores
    - Assemble the results into a single DataFrame

    The primary output of this class is the `df` attribute, which contains the quality scores for each feature.

    Args:
        prediction_dict:
            GLC predictions output. A mapping from feature to subclass prediction.
        embedder_obj:
            Fitted UMAPEmbedder object of the GGM structure.
        k:
            Number of neighbors used for local diversity calculations.

    Attributes:
        df (pd.DataFrame):
            DataFrame with one row per feature and the following columns:
            - feature: Feature identifier
            - subclass: Top predicted subclass
            - lsi_score: Local Simpson's Index score
            - pcor_score: Quantile-scaled PCOR score
            - product_score: Combined quality score in [0, 1]
    """

    self.prediction_dict = {}
    for key, val in prediction_dict.items():
        if val == []: # unless using label propagation in the GLC model, a very very small number of features may have no predictions
            self.prediction_dict[key] = [(np.nan, np.nan)]
        else:
            self.prediction_dict[key] = val


    self.emb = embedder_obj.umap_embedding
    self.k = k + 1 # +1, because the first one is the feature itself
    self.node_names = embedder_obj.node_names # node names are features/peak_id

    self.labels_encoded, self.label2code = self._encode_labels()
    self.distances, self.indices = self._knn()
    self.df = self._create_df()

glc.build_prediction_dataframe(subclass_predictions, mainclass_predictions, quality_score_df, feat_dicts)

Build a tidy DataFrame summarizing GLC subclass and main class predictions together with feature metadata and quality scores.

Parameters:

Name Type Description Default
subclass_predictions Dict[int, List[Tuple[str, float]]]

Dictionary mapping peak_id to a list of subclass predictions. Each entry is expected to be a ranked list, where the top prediction is accessed as subclass_predictions'[peak_id][0][0]'.

required
mainclass_predictions Dict[int, List[Tuple[str, float]]]

Dictionary mapping peak_id to a list of main class predictions, structured analogously to subclass_predictions.

required
quality_score_df DataFrame

DataFrame indexed by peak_id containing quality score columns 'lsi_score', 'pcor_score', and 'product_score'.

required
feat_dicts FeatDicts

A glc.FeatDicts object providing dictionary-like access to feature m/z and retention time via feat_dicts.mz and feat_dicts.rt.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with one row per peak_id containing feature metadata, predicted subclass and main class, and associated quality scores.

Source code in src/glc/glc_model.py
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
def build_prediction_dataframe(
    subclass_predictions: Dict[int, List[Tuple[str, float]]],
    mainclass_predictions: Dict[int, List[Tuple[str, float]]],
    quality_score_df: pd.DataFrame,
    feat_dicts: FeatDicts
) -> pd.DataFrame:
    """
    Build a tidy DataFrame summarizing GLC subclass and main class predictions
    together with feature metadata and quality scores.

    Args:
        subclass_predictions:
            Dictionary mapping peak_id to a list of subclass predictions.
            Each entry is expected to be a ranked list, where the top prediction
            is accessed as subclass_predictions'[peak_id][0][0]'.
        mainclass_predictions:
            Dictionary mapping peak_id to a list of main class predictions,
            structured analogously to subclass_predictions.
        quality_score_df:
            DataFrame indexed by peak_id containing quality score columns
             'lsi_score', 'pcor_score', and 'product_score'.
        feat_dicts:
            A glc.FeatDicts object providing dictionary-like access to
            feature m/z and retention time via feat_dicts.mz and feat_dicts.rt.

    Returns:
        pd.DataFrame:
            A DataFrame with one row per peak_id containing feature metadata,
            predicted subclass and main class, and associated quality scores.
    """
    results = []

    for peak_id in subclass_predictions.keys():
        if not subclass_predictions[peak_id]:
            scl_p = np.nan
            mcl_p = np.nan
        else:
            scl_p = subclass_predictions[peak_id][0][0]
            mcl_p = mainclass_predictions[peak_id][0][0]

        results.append({
            'peak_id': peak_id,
            'mz': feat_dicts.mz[peak_id],
            'rt': feat_dicts.rt[peak_id],
            'subclass': scl_p,
            'mainclass': mcl_p,
            'lsi_score': quality_score_df['lsi_score'].get(peak_id, np.nan),
            'pcor_score': quality_score_df['pcor_score'].get(peak_id, np.nan),
            'product_score': quality_score_df['product_score'].get(peak_id, np.nan),
        })

    return pd.DataFrame(results)