External References of English Wikipedia (ref-wiki-en)

From MaRDI portal
Dataset:6724711



DOI10.5281/zenodo.4001139Zenodo4001139MaRDI QIDQ6724711FDOQ6724711

Dataset published at Zenodo repository.

Aidan Hogan, Paolo Curotto

Publication date: 26 August 2020

Copyright license: Creative Commons Attribution 4.0 International



External References of English Wikipedia (ref-wiki-en) is a corpus of the plain-text content of 2,475,461 external webpages linked from the reference section of articles in English Wikipedia. Specifically: 32,329,989 external reference URLs were extracted from a 2018 HTML dump of English Wikipedia.Removing repeated and ill-formed URLs yielded 23,036,318 unique URLs. These URLs were filtered to remove file extensions for unsupported formats (videos, audio, etc.), yielding17,781,974 downloadable URLs.The URLs were loaded into Apache Nutchand continuously downloaded from August 2019 to December 2019, resulting in2,475,461 successfully downloaded URLs. Not all URLs could be accessed. The order in which URLs were accessed was determined by Nutch, which partitions URLs by host and then randomly chooses amongst the URLs for each host. The content of these webpages were indexed in Apache Solrby Nutch. From Solr we extracted a JSON dump of the content. Many URLs offer a redirect; unfortunately Nutch does notindex redirect information. This means that connecting the Wikipedia article (with the pre-direct link) to the downloaded webpage (at the post-redirect link) was complicated.However, by inspecting the order of download in theNutch log files, we managed to recoverlinks for 2,058,896 documents (83%) from their original Wikipedia article(s). We further managed to associate3,899,953 unique Wikidata items with at least one external reference webpage in the corpus. The ref-en-wiki corpusis incomplete, i.e., we did not attempt to download all reference URLs for English Wikipedia. We thus also collect a smaller complete corpus for the external references of 5,000 Wikipedia articles(ref-wiki-en-5k). We sampled from 5 ranges of Wikidata items: Q1-10000, Q10001-100000, Q100001-1000000, Q1000001-10000000, and Q10000001-100000000. From each range we sampled 1000 items. We then scraped the external reference URLs for the Wikipedia article corresponding to these items and downloaded them. The resulting corpus contains37,983 webpages. Each line of the corpus (ref-wiki-en, ref-wiki-en-5k)encodes the webpage of an external reference in JSON format. Specifically, we provide: tstamp: When the webpage was accessed host: The domain (FQDN post-redirect) from whichthe webpage was retrieved. title: The title (meta) of the document url: The URL(post-redirect) of the webpage Q: The Q-code identifiers of the Wikidata items whose corresponding Wikipedia article is confirmed to link to this webpage. content: A plain-text encoding of the content of the webpage. Below we provide an abbreviated example of a line from the corpus: {"tstamp":"2019-09-26T01:22:43.621Z","host":"geology.isu.edu","title":"Digital Geology of Idaho - Basin And Range","url":"http://geology.isu.edu/Digital_Geology_Idaho/Module9/mod9.htm","Q":[810178],"content":"Digital Geology of Idaho - Basin And Range\n1 - Idaho Basement Rock\n2 - Belt Supergroup\n3 - Rifting Passive Margin\n4 - Accreted Terranes\n5 - Thrust Belt\n6 - Idaho Batholith\n7 - North Idaho Mining\n8 - Challis Volcanics\n9 - Basin and Range\n10 - Columbia River Basalts\n11 - SRP Yellowstone\n12 - Pleistocene Glaciation\n13 - Palouse Lake Missoula\n14 - Lake Bonneville Flood\n15 - Snake River Plain Aquifer\nBasin and Range Province - Teritiary Extension\nGeneral geology of the Basin and Range Province\nMechanisms of Basin and Range faulting\nIdaho Basin and Range south of the Snake River Plain\nIdaho Basin and Range north of the Snake River Plain\nLocal areas of active and recent Basin Range faulting: Borah Peak\nPDF Slideshows: North of SRP , South of SRP , Borah Earthquake\nFlythroughs: Teton Valley , Henry's Fork , Big Lost River , Blackfoot , Portneuf , Raft River Valley , Bear River , Salmon Falls Creek , Snake River , Big Wood River\nVocabulary Words\nthrust fault\nBasin and Range\nSnake River Plain\nhalf-graben\ntransfer zone\n \n \n \n \nFly-throughs\nGeneral geology of the Basin and Range Province\nThe Basin and Range Province generally includes most of eastern California, eastern Oregon, eastern Washington, Nevada, western Utah, southern and western Arizona, and southeastern Idaho. ..."}, A summary of the files we make available: ref-wiki-en.json.gz:2,475,461 external reference webpages (JSON format) ref-wiki-en_urls.txt.gz:23,036,318 unique raw linksto external references (plain-text format) ref-wiki-en-5k.json.gz:37,983 external reference webpages (JSON format) ref-wiki-en-5k_urls.json.gz:70,375 unique raw links to external references(plain-textformat) ref-wiki-en-5k_Q.txt.gz: 5,000 Wikidata Q identifiers forming the 5k dataset (plain-text format) Further details can be found in the publication: Suggesting References for Wikidata Claims based on Wikipedias External References.Paolo Curotto, Aidan Hogan. Wikidata Workshop @ISWC 2020. Further material relating to this publication (including code for a proof-of-concept interface) is also available.







This page was built for dataset: External References of English Wikipedia (ref-wiki-en)