Software Heritage is a non-profit initiative whose goal is to become the most comprehensive archive of publicly accessible source code in the world, together with its full development history. The project archive already contains more than 4.5 billion source code files, more than 1 billion commits, coming from almost a hundred million software projects. It is a modern time Great Library of Source Code, growing daily.
The Software Heritage stack is entirely written in Python and supports archiving git repositories, subversion repositories, mercurial repositories, Debian source packages, as well as arbitrary archives (zip files, tarballs…) released by upstream authors. Everything gets stored in a common, fully deduplicated data model, allowing unified access to all archived content, regardless of the original means of distribution. The archive front- end, built upon the Django framework, allows people to browse the contents of the archive and download snapshots of source code that may have disappeared upstream.
While initially focused on archiving collaborative development forges such as GitHub, BitBucket, and GitLab, Software Heritage also supports archiving traditional software distributions, such as GNU/Linux distributions, and language-specific ecosystems. As an acknowledgement of the importance of the Python community for us, we are proud to announce the archival of PyPI into Software Heritage. This presentation will give a brief overview of the Software Heritage project and then drill down through the technical details of the integration with PyPI.