The Software Heritage Graph Dataset

From MaRDI portal



DOI10.5281/zenodo.2583978Zenodo2583978MaRDI QIDQ6716119FDOQ6716119

Dataset published at Zenodo repository.

Stefano Zacchiroli, Antoine Pietri, Diomidis Spinellis

Publication date: 5 March 2019

Copyright license: Creative Commons Attribution 4.0 International



Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The datasets contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. By accessing the dataset, you agree with the Software HeritageEthical Charter for using the archive data, and theterms of use for bulk access. If you use this dataset for research purposes, please cite the following paper: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings ofMSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located withICSE 2019. preprint,bibtex You can also refer to the above paper for more information the dataset and sample queries.







This page was built for dataset: The Software Heritage Graph Dataset