Supporting material for "Impact of gender on the formation and outcome of formal mentoring relationships in the life sciences"

From MaRDI portal
(Redirected from Dataset:6724410)



DOI10.5281/zenodo.6897394Zenodo6897394MaRDI QIDQ6724410FDOQ6724410

Dataset published at Zenodo repository.

Stephen V. David, Zachary P. Schwartz, Jean F. Liénard

Publication date: 24 July 2022

Copyright license: Creative Commons Attribution 4.0 International



This repository contains data and analysis code associated with the manuscript: L.P. Schwartz, J. Linard, S. V. David. (2022) Impact of gender on formation and outcome of formal mentoring relationships in the life sciences. Figures and tables in the manuscript can be produced by running the make_figures.ipynb notebook. Figures have been marked with headings indicating their position in the manuscript (Figure 1, Figure S1, etc.). In addition, the notebook contains code to reproduce regression analyses that are cited in the text but not directly associated with a figure. Data on mentoring relationships derives from Academic Family Tree (AFT, www.academictree.org) and public data sources on funding, publications, and awards. Inclusion criteria, public data sources, and procedures for linking across sources are described in the manuscript. Personal identifiers for researchers have been anonymized, but remain consistent across all data in the repository. In other words, the personal identifier 1 refers to the same person in all dataframes in the repository. But, that person is *not* the same researcher identified as 1 on the public AFT website. Installation Requires Python 3.x. and Pandas. To load required libraries using Anaconda, run: `conda create --name aft -c conda-forge pandas numpy scipy ipython jupyterlab scipy scikit-learn pandas matplotlib numpy statsmodels seaborn pytables` Dataframes Data is stored as a series of Pandas dataframes within HDF5 or CSV files: * cng_tc: The primary dataset used in the analysis. The name is an acronym for connections (i.e. training relationships, cn), gender (g), and trainee count (tc). Each row contains data on the mentor and trainee in one training relationship. See manuscript for inclusion criteria. * mentors: Data on mentors. Each row contains data on one mentor. See manunscript for inclusion criteria. * mentors_grants, mentors_hindex, mentors_locs_ranked: Subset of mentors with data available for funding (mentors_grants), citation (mentors_hindex), and institution rank (mentors_locs_ranked). * mentors_nobel, mentors_hhmi, mentors_nas: Subsets of mentors that received a Nobel (mentors_nobel), Howard Hughes Medical Institute grants (mentors_hhmi), or membership in the National Academy of Sciences (mentors_nas). See manuscript for details of data sources and linking procedures. * cn, cng, first_names, gn, gn_all, locs: Partial data (connections only, inferred gender only, connections and gender only, location only, first names and inferred gender only) for more inclusive sets of researchers in AFT. They are generally not used used for analysis, but have been included here to calculate statistics on the total amount of data included and to screen for data from U.S. locations. * nsf_gender_phds, nsf_gender_pds: National Science Foundation survey data on gender and fraction PhDs conferred per year (nsf_gender_phds) or fraction postdocs employed per year (nsf_gender_pds). See manuscript for details of data source. * photo: Data for validation of gender inference method. Dataframe columns * amount: Mentors total funding * amount_adj: Mentors total funding (adjusted to 2020 dollars) * broad_field: Mentors general research area (e.g., life sciences, engineering, based on National Science Foundation classifications) * continue: Whether trainee went on to become a mentor (i.e., has trainees listed in AFT) * country: Country in which mentors current institution is located * firstname: First name of researcher (table of first names is not aligned with tables containing anonymized personal identifiers) * first_grant_year: Year of mentors first grant * funding_rate: Mentors annual funding rate (since first grant) * funding_rate_adj: Mentors annual funding rate (since first grant) adjusted to 2020 dollars * hhmi: Whether mentor was granted HHMI funding * hindex: Mentors hindex * location: Name of mentors current institution * locid: Identifier for mentors institution * locid_rank: Postion of mentors institution in 2015 Quacquarelli-Symonds rankings (lower numbers are better) * locid_rank_rev: Reversed version of locid_rank (i.e., higher numbers are better) * majorarea: Mentors specific research area (e.g, neuroscience) * male_mentor, male trainee: Whether the probability that a researchers first name is used by a person identifying as a man meets threshold (see manuscript for details on gender inference using first names) * match_score: Score for string match between institution or name of awardee and researcher * mentor_career_start: The date at which the mentors academic career began * mentor_continue_rate: Fraction of mentors trainees that become mentors * mentor_continue_rate_ft: Fraction of mentors woman trainees that become mentors * mentor_continue_rate_mt: Fraction of mentors man trainees that become mentors * mentor_t_p_male0: Fraction of mentors trainees that are men * mentor_t_p_male0_gs: Fraction of mentors trainees that are men (graduate students only) * mentor_t_p_male0_pd: Fraction of mentors trainees that are men (postdocs only) * mentor_tcount0: Mentors total number of trainees * nas: Whether mentor is a member of the National Academy of Sciences * nobel: Whether mentor is a Nobel laureate * p_male_mentor, p_male_trainee: Probability that a researchers first name is used by a person identifying as a man * pid: Anonymized identifier of researcher * pid_mentor: Anonymized identifier of mentor in training relationship * pid_trainee: Anonymized identifier of trainee in training relationship * pq: 1 if data on training relationship is drawn from ProQuest database and has not been manually edited a human AFT user * relation: Type of training relationship (1: graduate student, 2: postdoc) * scorer1, scorer2, scorer3: Results of photo validation of gender inference for each scorer * start: Training start year * stop: Training end year * trainee_tcount: Total people that the trainee has trained * triad: Whether trainee has participated in both a graduate-level and postdoctoral training relationship The cn dataframe follows slightly different naming conventions, but is not generally used in the analysis (pid1 = pid_trainee, pid2 = pid_mentor, startdate = start, stopdate = stop).







This page was built for dataset: Supporting material for "Impact of gender on the formation and outcome of formal mentoring relationships in the life sciences"