CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem (Q6706844)

!

WARNING

This is the item page for this Wikibase entity, intended for internal use and editing purposes.

Please use the normal view instead:

CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem

Dataset published at Zenodo repository.

Language	Label	Description	Also known as
default for all languages	No label defined
English	CofeXHug: A curated dataset of HuggingFace pre-trained models exploited in the GitHub ecosystem	Dataset published at Zenodo repository.

Statements

instance of

data set

0 references

description

Pre-trained models (PTMs) are becoming increasingly popular in the software engineering community. Their usage is facilitated by model repositories, e.g., HuggingFace, which collect, store, and maintain a wide range of PTMs. However, the actual adoption of these models in real-world projects is still an open question. In particular, many of them are used in toy projects or simply as a mirror for the HF repository. Thus, we see the need for a curated codebase related to PTMs to support developers and practitioners who are interested in using them in their projects.This artifact contains CodeXHug, a curated dataset of HuggingFace PTMs exploited in the GitHub ecosystem. Starting from the latest HF dump, we first conduct a data curation to collect PTMs with a tag and a model card. Then, the GitHub platform has been queried to find actual usages of the identified PTMs, resulting in 7,325 different models and 372,063 Python files. We also present a statistical analysis of the dataset, highlighting the most popular PTMs and the most common tasks for which they are used. Finally, we discuss the research opportunities enabled by CodeXHug and the implications of our findings for the software engineering community.

0 references

publication date

3 December 2024

0 references

0 references

0 references

0 references