Datasets for OntoClue Project

DOI10.5281/zenodo.14801641Zenodo14801641MaRDI QIDQ6693479FDOQ6693479

Dataset published at Zenodo repository.

Lukas Geist, Leyla Jael Castro, Dietrich Rebholz-Schuhmann, Rohitha Ravinder

Publication date: 10 February 2025

Copyright license: Creative Commons Attribution 4.0 International

Description This release contains the datasets and files associated with the OntoClue project, which investigates various text embedding techniques for assessing document-to-document similarity in biomedical literature. The project primarily utilizes the RELISH Corpus [1], a comprehensive dataset curated by experts that includes relevance annotations for document pairs based on their similarity. This release includes datasets for establishing ground truth, as well as retrieved titles and abstracts for all PMIDs in the RELISH database. The files also contain preprocessed tokens for use in text embedding neural network models, as well as annotated tokens based on the MeSH (Medical Subject Headings) [2] vocabulary. Data Structure and Files missing_pmids.tsv: List of PMIDs for which titles and abstracts could not be retrieved relevance_matrix.tsv : Ground truth dataset file derived from the RELISH JSON file containing 189,634 documents pairs, with three columns: PMID1 (reference article), PMID2 (assessed article), and relevance (relevance score between the two documents). Consists of 68,479 completely relevant pairs, 65,406 partially relevant pairs and 55,749 irrelevant pairs. relish_documents.tsv: Contains retrieved RELISH documents, including PMID, title and abstract (163,189 articles) relish_bert_input_text.zip: Preprocessed titles and abstracts for use with BERT-based models relish_preprocessed_normal_tokens.zip: Document text preprocessed for use with all embeddings approaches relish_normal_split_datasets.zip: Preprocessed document text split into training, validation and test datasets relish_xml_files.zip: RELISH articles retrieved as XML files relish_annotated_xml_files.zip: Annotated XML files of RELISH articles (163,189 articles) relish_preprocessed_annotated_tokens.zip: Document text preprocessed for use with all embeddings approaches, with annotations relish_annotated_split_datasets.zip: Preprocessed and annotated document text split into a training, validation and test datasets relish_ground_truth_split_datasets.zip: Ground truth dataset split into a training, validation and test datasets Data Collection The RELIHS dataset v1 was downloaded from the corresponding FigShare record [3] on January 24th, 2022. The dataset, in JSON format, contains PubMed IDs (PMIDs) along with relevance assessments for document pairs. Using the BioC API, we retrieved XML files containing the PMID, title, and abstract for each unique entry in the RELIHS JSON file. Any PMIDs that failed to retrieve, or lacked titles and abstracts, were recorded as missing. In total, approximately 163,189 XML files were successfully retrieved. These XML files were also converted into a TSV file with three columns: PMID, title, and abstract. The text from the titles and abstracts was further preprocessed for use in various approaches. References [1] Peter Brown, RELISH Consortium , Yaoqi Zhou, Large expert-curated database for benchmarking document similarity detection in biomedical literature search,Database, Volume 2019, 2019, baz085,https://doi.org/10.1093/database/baz085 [2] Lipscomb C. E. (2000). Medical Subject Headings (MeSH).Bulletin of the Medical Library Association,88(3), 265266. [3] Brown, Peter (2019). RELISH_v1. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7722905.v1

This page was built for dataset: Datasets for OntoClue Project