AminerMag X Dataset

From MaRDI portal
(Redirected from Dataset:6700965)



DOI10.5281/zenodo.7893403Zenodo7893403MaRDI QIDQ6700965FDOQ6700965

Dataset published at Zenodo repository.

Haesun Park, Richard Vuduc, Koby Hayashi, Benjamin Cobb, Grey Ballard, Srinivas Eswar, Ramakrishnan Kannan

Publication date: 21 June 2023

Copyright license: Creative Commons Attribution 4.0 International



A subset of the Microsoft Open Academic Graph (OAG), a dataset consisting of a unification of the Microsoft Academic Graph (MAG) and ArnetMiner (AMiner) academic graphs each respectively containing 166,192,182 and 154,771,162 papers. From this dataset, a subset of 37,732,477 papers with available abstracts and citation information were selected. These abstracts were preprocessed using stop words and stemming to form a vocabulary of 1,333 unique words. Together this vocabulary and corpus of papers were used to form a sparse 1,333 37,732,477 term-document matrix with 1,295,114,641 nonzeros, wherein each column represents a paper as a tf-idf vector. The resulting matrix was used as the X in the real world experiments. The symmetric graph Laplacian matrix S was then formed from the citation graph. Each of the 966,206,008 nonzeros of the resulting 37,732,477 37,732,477 matrix represents a citation between two papers.







This page was built for dataset: AminerMag X Dataset