The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms

DOI10.5281/zenodo.7024894Zenodo7024894MaRDI QIDQ6716661FDOQ6716661

Dataset published at Zenodo repository.

David Atienza, Lara Orlandic, Tomas Teijeiro

Publication date: 3 February 2021

Copyright license: Creative Commons Attribution 4.0 International

Overview Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 30,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks.As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the worlds most urgent health crises. Private Set and Testing Protocol Researchers interested in testing their models on the private test dataset should contact us at coughvid@epfl.ch, briefly explaining the type of validation they wishto make, and their obtained results obtained throughcross-validation with the public data. Then, access to the unlabeled recordings will be provided, andthe researchers shouldsend the predictions of their models on these recordings. Finally,theperformance metrics of the predictions will be sent to the researchers. The private testing data is not included in any file within our Zenodo record, and it can only be accessed by contacting the COUGHVID team at the aforementioned e-mail address. New Semi-Supervised Labeling The third version of the COUGHVID dataset contains thousands of additional recordings obtained through October 2021. Additionally, the recordings containing coughs were re-labeled according to a semi-supervised learning algorithm that combined the user labels with those of the expert physicians, which weremodeled using ML and expanded on the previously unlabeled data. These labels can be found in the status_SSL column of the metadata_compiled.csv file.

This page was built for dataset: The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms