MLFMF: Data Sets for Machine Learning for Mathematical Formalization

DOI10.5281/zenodo.10041075ZenodoMaRDI QIDQ6702011FDO

Authors Andrej Bauer, Matej Petković, Ljupčo Todorovski

Publication date 26 October 2023

Copyright license Creative Commons Attribution 4.0 International

MLFMFMLFMF (Machine Learning for Mathematical Formalization) is a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction.The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. With more than 250,000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format.In addition to benchmarking the recommendation systems, the data sets can also be used for benchmarking node classification and link prediction algorithms.The four data setsEach data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes the largest Lean 4 library Mathlib,the three largest Agda libraries:the standard librarythe library of univalent mathematics Agda-unimath, andthe TypeTopology library.Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the syntax trees give complete and easily parsed information about each entry.The Lean library data set was obtained by converting .olean files into s-expressions (see the lean2sexp tool).The Agda data sets were obtained with an s-expression extension of the official Agda repository (use either master-sexp or release-2.6.3-sexp branch).For more details, see our arXiv copy of the paper.Directory structureFirst, the mlfmf.zip archive needs to be unzipped. It contains a separate directory for every library (for example, the standard library of Agda can be found in the stdlib directory) and some auxiliary files. Every library directory containsthe network file from which the heterogeneous network can be loaded,a zip of the entries directory that contains (many) files with abstract syntax trees. Each of those files describes a single entry of the library.In addition to the auxiliary files which are used for loading the data (and described below), the zipped sources of lean2sexp and Agda s-expression extension are present.Loading the dataIn addition to the data files, there is also a simple python script main.py for loading the data. To run it, you will have to install the packages listed in the file requirements.txt: tqdm and networkx. The easiest way to do so is calling pip install -r requirements.txt.When running main.py for the first time, the script will unzip the entry files into the directory named entries. After that, the script loads the syntax trees of the entries (see the Entry class) and the network (as networkx.MultiDiGraph object).Note. The entry files have extension .dag (directed acyclic graph), since Lean uses node sharing, which breaks the tree structure (a shared node has more than one parent node).More informationFor more information about the data collection process, detailed data (and data format) description, and baseline experiments that were already performed with these data, see our arXiv copy of the paper.For the code that was used to perform the experiments and data format description, visit our github repository https://github.com/ul-fmf/mlfmf-data.FundingSince not all the funders are available in the Zenodo's database, we list them here:This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0024.The authors also acknowledge the financial support of the Slovenian Research Agency via the research core funding No. P2-0103 and No. P1-0294.

This page was built for dataset: MLFMF: Data Sets for Machine Learning for Mathematical Formalization