Authors: Blake Borgeson, Center for Systems and Synthetic Biology, University of Texas at Austin; Cuihong Wa
Despite our knowledge that the vast majority of life's processes at a cellular level are carried out by complexes of multiple proteins, knowledge of all the complexes formed in a cell and their members is a distant goal. By using a new approach first applied to human cell lines by collaborators Havugimana and Hart, et al, consisting of 1) subjecting biological samples to many levels of many types of fractionations, 2) using mass spectrometry to quantify protein levels in each fraction, and 3) processing the data through a machine learning pipeline, we are able to seek complexes using a high-throughput all-by-all approach. By incorporating additional functional genomic information into our learning process, we are able to reconstruct maps of complexes that rival in quality and far surpass in coverage those generated with previously-used, much more labor-intensive methods such as affinity purification followed by mass spectrometry, or AP-MS. Here, using 6,000 mass spectrometry experiments from more than 60 fractionated biological samples from human, mouse, sea urchin, fly and worm, we predict with high confidence hundreds (~500) of expected and putative novel conserved complexes. IPython, SciPy, and scikit-learn are the foundational tools used to handle data integration and machine learning, and an integrated python environment for this work has been critical to the speed of progress.