Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept Discovery and Placement

DOI10.5281/zenodo.10432003Zenodo10432003MaRDI QIDQ6718563FDOQ6718563

Dataset published at Zenodo repository.

Ian Horrocks, Yuan He, Jiaoyan Chen, Hang Dong

Publication date: 26 December 2023

Copyright license: Creative Commons Attribution 4.0 International

A biomedical dataset supporting ontology enrichment from texts, by concept discovery and placement, adapting the MedMentions dataset (PubMed abstracts) with SNOMED CT of versions in 2014 and 2017 under the Diseases (disorder) sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic (CPP) product. The dataset is documented in the work,Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement, on arXiv: https://arxiv.org/abs/2306.14704 (CIKM 2023). The companion code is available at https://github.com/KRR-Oxford/OET. Out-of-KB mention discovery (including the settings of mention-level data) is further partly documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023). ver4: we made a version of mention-level data for out-of-KB discovery and concept placement separately: the former (for out-of-KB discovery) has out-of-KB mentions in training data, while the latter (for concept placement) has only out-of-KB mentions during the evaluation (validation and test) and not in the training data. Also, we split the original "test-NIL.jsonl" (now "test-NIL-all.jsonl") into "valid-NIL.jsonl" and "test-NIL.jsonl" for a better evaluation. ver3: we revised and updated mention-level data (syn_full, synonym augmentation setting) and the folder structure, and also updated the edge catalogues with complex edges. ver2: we revised the mention-level data by only keeping out-of-KB mentions (or "NIL" mentions) associated with one-hop edges (including leaf nodes, as leaf node, NULL) and two-hop edges in the ontology (SNOMED CT 20140901). Acknowledgement of data sources and tools below: * SNOMED CT https://www.nlm.nih.gov/healthit/snomedct/archive.html (and use snomed-owl-toolkit to form .owl files)* UMLS https://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html (and mainly use MRCONSO for mapping UMLS to SNOMED CT)* MedMentions https://github.com/chanzuckerberg/MedMentions (source of entity linking) * Protg http://protegeproject.github.io/protege/* snomed-owl-toolkit https://github.com/IHTSDO/snomed-owl-toolkit* DeepOnto https://github.com/KRR-Oxford/DeepOnto (based on OWLAPI https://owlapi.sourceforge.net/) for ontology processing and complex concept verbalisation

This page was built for dataset: Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept Discovery and Placement