The study of the proteome (the set of proteins expressed by a cell, tissue, or organism at a defined time and conditions) generates a large amount of complex data. This data should be processed, stored, curated, and made easily available to researchers so it can be studied to obtain biomedical knowledge. In this talk we expose the Python tools we have used and developed to accomplish it.
At the LP-CSIC/UAB we use a technology called Mass Spectrometry to study the phospho-proteome of human T-Lymphocytes; this is, the group of proteins that are phosphorilated (modified with phosphate groups) in the human T cells during their activation and differentiation as part of the immune response. The experiments involved in this study generates a large amount of complex data:
- Experimental conditions and procedures metadata.
- The spectra data and metadata obtained from the mass spectrometers (usually in proprietary binary formats).
- Qualitative or identification data, consisting of search results of those spectra against human protein sequence databases (using multiple search engines, that use different output file-formats) linking the spectra with possible peptides (small protein sequence fragments) that may or may not contain phosphorylated amino-acids.
- Semi-quantitative data about the abundance of each possible identified peptide.
- Missing values, identification scores, phosphorylation reassignments, and a lot of relationships and inter-linked data…
And all this data has to be processed, stored, curated, and made easily accessible to researchers in our lab and worldwide, so they can study it to obtain biomedical knowledge about the phosphorylation changes in peptides and proteins involved in the signal transduction pathways of T cells after their activation during the specific immune response.
We have used different Python packages to develop different tools and applications to accomplish those objectives:
- The EasierMgf front-end application, to extract plain text spectra data from proprietary binary mass spectrometer files, was developed using the wxPython GUI toolkit.
- The ORM from the django framework was used to design and interact with the LymPHOS2 MySQL database, which stores all the data and their inter-relationships.
- Also, different modules from the Python standard library (json, xml.etree.ElementTree, csv, zipfile) were used to read and import into the database the different files containing the data; and to export database data into commonly used file formats for the researches to work with.
- The PQuantifier group of tools was developed to do the statistical processing and analysis of the LymPHOS2 semi-quantitative data. It uses the SQLAlchemy ORM for fast database storage in local SQLite files, the NumPy N-dimensional array package, the SciPy scientific computing library for the statistics, the uncertainties package for calculations with error propagation, and the matplotlib 2D plotting library for some nice plots of data distributions.
- And the full django framework itself was used to develop the LymPHOS2 web application. Which also uses the matplotlib library to dynamically generate the mass spectra images.
The final result is the LymPHOS2 web-oriented database, that nowadays (2017) contains 131.908 mass spectra, 15.566 phosphorylation sites from 8.273 unique phospho-peptides and 4.937 proteins (which represent a 45-fold increase over the original LymPHOS database of 2009); aside from the new quantitative data for 1.975 of the identified phospho-peptides, which was not present in the previous version of LymPHOS.
Repositories and Presentation Slides:
- Bitbucket code repositories: https://bitbucket.org/lp-csic-uab/
- Presentation Slides (.ODP, .PDF and .PPTX): https://bitbucket.org/lp-csic-uab/lymphosdocs/downloads/
The exposed work has been carried out at LP-CSIC/UAB from Catalonia, part of the Spanish National Research Council (Consejo Superior de Investigaciones Científicas - CSIC) and of ProteoRed (Proteomics National NetWork Platform).
The people who have participated directly in the current work are:
- Data analysis, bioinformatics and informatics: Joaquin Abian and Óscar Gallardo.
- Mass Spectrometry, experimental design and implementation: Montserrat Carrascal, Nguyen Tien Dung and Oriol Vidal-Cortes.
- Sample preparation: Montserrat Carrascal, Nguyen Tien Dung, Oriol Vidal-Cortes and Vanessa Casas.
- Past collaborators: David Ovelleiro and Marina Gay.
- Direction: Joaquin Abian.