CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem

DOI10.5281/zenodo.14267550Zenodo14267550MaRDI QIDQ6706844FDOQ6706844

Dataset published at Zenodo repository.

Davide Di Ruscio, Claudio di Sipio, Stefano Palombo, Juri di Rocco

Publication date: 3 December 2024

Copyright license: Creative Commons Attribution 4.0 International

Pre-trained models (PTMs) are becoming increasingly popular in the software engineering community. Their usage is facilitated by model repositories, e.g., HuggingFace, which collect, store, and maintain a wide range of PTMs. However, the actual adoption of these models in real-world projects is still an open question. In particular, many of them are used in toy projects or simply as a mirror for the HF repository. Thus, we see the need for a curated codebase related to PTMs to support developers and practitioners who are interested in using them in their projects.This artifact contains CodeXHug, a curated dataset of HuggingFace PTMs exploited in the GitHub ecosystem. Starting from the latest HF dump, we first conduct a data curation to collect PTMs with a tag and a model card. Then, the GitHub platform has been queried to find actual usages of the identified PTMs, resulting in 7,325 different models and 372,063 Python files. We also present a statistical analysis of the dataset, highlighting the most popular PTMs and the most common tasks for which they are used. Finally, we discuss the research opportunities enabled by CodeXHug and the implications of our findings for the software engineering community.

This page was built for dataset: CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem