SugarPy Run

class sugarpy.run.Run(monosaccharides={})

The SugarPy run class comprises the main functionality for glycopeptide matching. A typical workflow contain the following steps: 1. parse_ident_file:

Ursgal result files are parsed and peptide sequences, as well as their modifications (except monosaccharides that would be part of the glycan) and retention times (RTs), are extracted.
  1. build_combinations and add_glycans2peptide:
    For a set of monosaccharides (param: monosaccharides) and maximal glycan length (param: max_tree_length), all possible combinations of monosaccharides are calculated and the chemical compositions of the resulting theoretical glycans (taking into account glycans with the same mass) are added to the chemical compositions of the extracted set of peptides.
  2. quantify:
    pyQms is used to build isotope envelope libraries for the theoretical glycopeptides and to match them against all MS1 spectra within the given RT windows. It should be noted that isotope envelopes consist of the theoretical m/z and relative intensity for all isotopic peaks. Therefore, the quality of the resulting matches is indicated by an mScore, which comprises the accuracies of the measured m/z and intensity.
  3. sort_results and validate_results:
    For each matched molecule, a score (VL) is calculated as the length of a vector for the mScore (ranging from 0 to 1) and intensity (normalized by the maximum intensity of matched glycopeptides within the run, therefore also ranging from 0 to 1). For each spectrum, all matched molecules are sorted by the glycan length (number of monosaccharides). Subsequently, starting with the longest glycan, for each glycan length, all glycan compositions are checked if they are part of any glycan composition of the previous level (longer glycan). Glycan compositions that are true subsets of larger, matched glycans (subtrees of those) are considered fragment ions and are therefore merged with the larger, final glycopeptides. It should be noted that glycan compositions can be subtrees of multiple final glycans. Furthermore, fragmentation pathways are not taken into account, however, if Y1-ions are matched (peptide harboring one monosaccharide), the corresponding monosaccharide is noted as the reducing end. For all final glycopeptides within one spectrum, the subtree coverage is calculated. Finally, the SugarPy score is calculated for each glycopeptide as the sum of vector lengths from all corresponding subtrees (fragment ions Y0 to Yn) multiplied by the subtree coverage.
add_glycans2peptide(peptide_list=[], max_tree_length=None, monosaccharides=None)

Adds chemical composition of glycans to a given list of peptides. Peptides need to be in unimod style (Peptide#Modifications). The chemical composition of the original peptidoform is returned as well.

Keyword Arguments:
peptide_list (list): List of peptides in unimod style max_tree_length (int): maximum number of monosaccharides in one combination monosaccharides(dict): dictionary containing name and chemical composition of monosaccharides
Returns:
dict: { ‘Sequence#Modifications : {glycan_hill_notation’: [‘Name’]}}
build_combinations(max_tree_length=None, monosaccharides=None, mode='replacement')

Builds and returns a dictionary containing chemical compositions of all combinations (with replacement, not ordered) of a given dict of monosaccharides and a maximal length of the tree.

Keyword arguments:
max_tree_length (int): Maximum number of monosaccharides in one combination monosaccharides(dict): Dictionary containing name and chemical composition of monosaccharides
Returns:
dict: keys: chemical compositions of all combinations (with replacement, not ordered),
values: combination(s) monosaccharide names corresponding to the chemical composition
ToDo: change monosaccharides to list and get compositions from ursgal.ChemicalComposition(),
keyword argument for calculate_formula?
build_match_dict(spectrum_dict=None)

Uses a spec_collector (see sort_results) spectrum to extract the peptide and glycan composition and to sort glycans according to their length.

Keyword Arguments:
spectrum_dict (dict): dictionary for a spectrum from spec_collector
(see sort_results)
Returns:
dict: { n (tree length) : [{
formula: , glycan_comp: , vector_length: , }, … ]

}

build_rt_lookup(mzml_file, ms_level)

Builds and returns a dictionary equivalent to the ursgal_lookup.pkl It contains a dictionary (key is scan number, value is rt) for every mzml file name.

Arguments:
mzML_file: Path to the mzML file. ms_level: MS level for which the lookup should be built
Returns:
dict
extract_pep_and_glycan_comp(trivial_name=None)

Use the trivial name to extract information about the peptide and glycan composition. This also works for just extracting the glycan composition from a glycan string.

Keyword Argumens:
trivial_name(‘str’): trivial name of the glycan (‘HexNAc(2)Hex(5)’) or
glycopeptide (‘PEPTIDE|HexNAc(2)Hex(5)’)
Returns:

peptide(str): peptide sequence glycan_comp(dict): glycan composition as dictionary with monosaccharides as keys

and their number as value
parse_ident_file(ident_file=None, unimod_glycans_incl_in_search=[])

Parses an Ursgal results .csv file and extracts identified peptides together with their retention times. Glycans that were included in the search as modifications are removed.

Keyword Arguments:
ident_file (str): Path to the Ursgal result .csv file.
This file should only include (potential) glycopeptides, i.e. it should be filtered.
unimod_glycans_incl_in_search (list): List of Unimod PSI-MS names
corresponding to glycans that were included in the database search as modification (will be removed from the peptide).
Returns:
dict: Lookup containing retention times and accuracies of all
PSMs for each identified peptidoform (Peptide#Unimod:Pos)
quantify(molecule_name_dict=None, rt_window=None, ms_level=1, charges=None, params=None, pkl_name='', mzml_file=None, spectra=None, return_all=False, collect_precursor=False, force=False)

Quantify a list of molecules in a given mzML file using pyQms. Quantification is done by default on MS1 level and can be specified for a retention time window.

Keyword Arguments:
molecule_name_dict (dict): contains for the molecules that should be quantified
as hill notations (keys) a list of corresponding trivial names (values)
rt_window (dict): optional argument to define a retention time window
in which the molecules are quantified (use ‘min’ and ‘max’ as keys in the dict)

ms_level: MS level for which quantification should be performed charges (list): list of charge states that are quantified params (dict): pyQms parameters (see pyQms manual for further information) pkl_name (str): name of the result pickle containing the pyQms results mzml_file (str): path to the mzML file used for the quantification spectra (list): optional list of spectrum IDs that should be quantified return_all (bool): if True, in addition to the results pkl, the IsotopologueLibrary

as well as the spectrum peaks are returned. This should only be used for a single spectrum.
Returns:
str: path to the results pickle
sort_results(results=None, min_spec_number=1)

Parse through pyQms results, determines vector length for each matched spectrum with vectors beeing defined by the mScore and the normalized intensity (normalized scaling factor).

Keyword Arguments:
min_spec_number (int): defines the minimum number of spectra for one matched formula. results (dict): pyQms results dictionary
Returns
dict: spec_collector = { matched_spectrum : { formula : {
‘vector’ : [], ‘charge’ : [], ‘trivial_name’ : [], ‘glycan_comp’ : [], ‘glycan_trees’ : [],

}

validate_results(pyqms_results_dict=None, min_spec_number=0, min_tree_length=0, monosaccharides=None)

Parse through pyQms results list and validate the results which includes the following: * sort_results: determines vector length for each matched spectrum with vectors beeing defined

by the mScore and the normalized intensity (normalized scaling factor). Also filters for a minimum number of spectra (for each molecule) in the results
  • build_match_dict: extracts information about glycan compositions, sorts glycans by their length
  • starting with the longest glycans, for each level (glycan length) the corresponding glycans
    are determined (glycans that are subtrees of longer glycans) and merged
  • the quality of glycan assignments is assessed by calculating the SugarPy_score ( (Sum of vector lengths)*subtree coverage),
    the subtree_coverage (Number of unique matched subtree lengths/Total number of unique subtree lengths) and the number of matched subtrees
Keyword Arguments:
pyqms_results_dict (dict): dictionary containing the Peptides#Unimod (key) and corresponding pyqms result pkl (value) min_spec_number (int): defines the minimum number of spectra for one matched formula. min_tree_length (int): minimum number of monosaccharides per glycan monosaccharides (dict): dictionary containing name and chemical composition of monosaccharides
Returns
results class object (dict): class (dict) containing all scored_glycans as well as the spec_collector
for every peptide_unimod