CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem (Q6706844)
From MaRDI portal
| This is the item page for this Wikibase entity, intended for internal use and editing purposes. Please use this page instead for the normal view: CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem |
Dataset published at Zenodo repository.
| Language | Label | Description | Also known as |
|---|---|---|---|
| default for all languages | No label defined |
||
| English | CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem |
Dataset published at Zenodo repository. |
Statements
Pre-trained models (PTMs) are becoming increasingly popular in the software engineering community. Their usage is facilitated by model repositories, e.g., HuggingFace, which collect, store, and maintain a wide range of PTMs. However, the actual adoption of these models in real-world projects is still an open question. In particular, many of them are used in toy projects or simply as a mirror for the HF repository. Thus, we see the need for a curated codebase related to PTMs to support developers and practitioners who are interested in using them in their projects.This artifact contains CodeXHug, a curated dataset of HuggingFace PTMs exploited in the GitHub ecosystem. Starting from the latest HF dump, we first conduct a data curation to collect PTMs with a tag and a model card. Then, the GitHub platform has been queried to find actual usages of the identified PTMs, resulting in 7,325 different models and 372,063 Python files. We also present a statistical analysis of the dataset, highlighting the most popular PTMs and the most common tasks for which they are used. Finally, we discuss the research opportunities enabled by CodeXHug and the implications of our findings for the software engineering community.
0 references
3 December 2024
0 references