Worldwide Gender Differences in Public Code Contributions - Replication Package

From MaRDI portal



DOI10.5281/zenodo.6020475Zenodo6020475MaRDI QIDQ6716126FDOQ6716126

Dataset published at Zenodo repository.

Davide Rossi, Stefano Zacchiroli

Publication date: 9 February 2022

Copyright license: Creative Commons Attribution 4.0 International



Worldwide Gender Differences in Public Code Contributions - Replication Package This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022,Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages.https://doi.org/10.1145/3510458.3513011 This document comes with the software needed to mine and analyze the data presented in the paper. Prerequisites These instructions assume the use of thebashshell, thePythonprogramming language, thePosgreSQLDBMS (version 11 or later), thezstdcompression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create aPython virtual environmentand install the following PyPI packages:click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0 Initial data swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available athttps://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data fromSoftware Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli,The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populateswh-replica. names.tab- forenames and surnames per country with their frequency zones.acc.tab- countries/territories, timezones, population and world zones c_c.tab- ccTDL entities - world zones matches Data preparation Export data from theswh-replicadatabase to createcommits.csv.zstandauthors.csv.zstsh ./export.sh Run the authors cleanup script to createauthors--clean.csv.zstsh ./cleanup.sh authors.csv.zst Filter out implausible names and createauthors--plausible.csv.zstsh pv authors--clean.csv.zst | unzstd | ./filter_names.py 2 authors--plausible.csv.log | zstdmt authors--plausible.csv.zst Gender detection Run the gender guessing script to createauthor-fullnames-gender.csv.zstsh pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt author-fullnames-gender.csv.zst Database creation and data ingestion Create the PostgreSQL DBsh createdb gender-commitNotice that from now on when prepending thepsqlprompt we assume the execution of psql on thegender-commitdatabase. Import data into PostgreSQL DBsh ./import_data.sh Zone detection Extract commits data from the DB and createcommits.tab, that is used as input for the gender detection script sh psql -f extract_commits.sql gender-commit Run the world zone detection script to createcommit_zones.tab.zstsh pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt commit_zones.tab.zstUse./assign_world_zone.py --helpif you are interested in changing the script parameters. Read zones assignment data from the file into the DB psql \copy commit_culture from program zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev \s$ Extraction and graphs Run the script to execute the queries to extract the data to plot from the DB. This createscommits_tz.tab,authors_tz.tab,commits_zones.tab,authors_zones.tab, andauthors_zones_1620.tab. Editextract_data.sqlif you whish to modify extraction parameters (start/end year, sampling, ...).sh ./extract_data.sh Run the script to create the graphs from all the previously extracted tabfiles. This will generatecommits_tzs.pdf,authors_tzs.pdf,commits_zones.pdf,authors_zones.pdf, andauthors_zones_1620.pdf.sh ./create_charts.sh Additional graphs This package also includes some already-made graphs authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period authors_zones_2.pdf: ditto with at least two commits per period authors_zones_10.pdf: ditto with at least ten commits per period







This page was built for dataset: Worldwide Gender Differences in Public Code Contributions - Replication Package