Replication Package for 'How do Machine Learning Models Change?'
DOI10.5281/zenodo.14160172Zenodo14160172MaRDI QIDQ6702290FDOQ6702290
Dataset published at Zenodo repository.
Rafael Cabañas, Silverio Martínez-Fernández, Joel Castaño Fernández, David Lo, Antonio Salmerón
Publication date: 14 November 2024
Copyright license: Creative Commons Attribution 4.0 International
Overview This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks. Our research addresses three main aspects: Categorization of Commit Changes: We classified over 200,000 commits on HF using an extended ML change taxonomy, providing a detailed breakdown of change types and their distribution across models. Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns and common progression paths in model changes. Release Analysis: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases. This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper. Data Collection and Preprocessing Data Collection We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed onNovember 6th, 2023. The collected data includes: Model Information: Details of over 380,000 models, including dataset sizes, training hardware, evaluation metrics, model file sizes, number of downloads and likes, tags, and the raw text of model cards. Commit Histories: Comprehensive commit details, including commit messages, dates, authors, and the list of files edited in each commit. Release Information: Information on model releases marked by tags in their repositories. To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset. Data Preprocessing Commit Diffs We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys. Commit Classification We classified each commit according to Bhatia et al.'s ML change taxonomy using theGemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving aCohen's kappa coefficient 0.9 through iterative validation.In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy. Model Metadata We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc.We also calculated the differences between the metadata of successive releases. Folder Structure The replication package is organized as follows: - code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts. Collection/: Contains two Jupyter notebooks for data collection: HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform. HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases. Preprocessing/: Contains preprocessing scripts: HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`. HFCommitsPreprocessing.ipynb: Processes commit data, including: Retrieval of diff information between commits. Classification of commits following Bhatia et al.'s taxonomy using LLMs. Extension and adaptation of the final commits dataset, including additional variables for Bayesian network analysis. HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis. Analysis/: Contains three Jupyter notebooks with the analysis for each research question: RQ1_Analysis.ipynb: Analysis for RQ1. RQ2_Analysis.ipynb: Analysis for RQ2. RQ3_Analysis.ipynb: Analysis for RQ3. -datasets/: Contains the raw, processed, and manually curated datasets used for the analysis. Main Datasets: HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy. HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences. HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy. model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases. These datasets correspond to the following dataset splits: +200,000 commits from 50,000 models: Used for RQ1. Provides a broad overview of commit types and patterns across diverse models. +200,000 commits from 10,000 models: Used for RQ2. Focuses on models with at least 10 commits for detailed evolutionary study. +1,200 releases from 127 models: Used for RQ3.1, RQ3.2, and RQ3.3. Facilitates the investigation of release patterns and their evolution. Metadata of 173 releases from 27 models: Used for RQ3.4. Analyzes the evolution of model parameters and configurations. Additional Datasets: HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained fromHFTotalExtraction.ipynb. HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained fromHFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing. Auxiliary datasets generated during processing are also included to facilitate reproduction of specific parts of the code without time-consuming steps. - metadata/: Contains thetags_metadata.yaml file used during preprocessing. - models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories. - requirements.txt: Lists the required Python packages to set up the environment and run the code. Setup and Execution Prerequisites Python 3.10.11 or later. Jupyter Notebook orJupyterLab. Installation Download and Extract the Replication Package Create a Virtual Environment (Recommended):bashpython -m venv venvsource venv/bin/activate # On Windows, use venv\Scripts\activate Install Required Packages:bashpip install -r requirements.txt Notes -LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model. -Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power. -Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline. Additional Information Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu. This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.
This page was built for dataset: Replication Package for 'How do Machine Learning Models Change?'