Ancient Greek language models

DOI10.5281/zenodo.8369516Zenodo8369516MaRDI QIDQ6704409FDOQ6704409

Dataset published at Zenodo repository.

Barbara McGillivray, Peels-Matthey, Nissim, Nilo Pedrazzini, Stopponi

Publication date: 22 September 2023

Copyright license: Creative Commons Attribution 4.0 International

In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values. Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into Diachronica and ALP models, according to the published paper they are associated with. [Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica. [ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006 Diachronica models Training data Diorisis corpus (Vatri McGillivray 2018). Separate models were trained for: Classical subcorpus Hellenistic subcorpus Whole corpus Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus). Models Count-based Software used: LSCDetection (Kaiser et al. 2021;https://github.com/Garrafao/LSCDetection) a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75. b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Word2Vec Software used: CADE (Bianchi et al. 2020;https://github.com/vinid/cade). a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20. b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20. Syntactic word embeddings Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug Jhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman Crane 2011) largely following the SuperGraph method described in Al-Ghezi Kurimo (2020) and the Node2Vec architecture (Grover Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1. ALP models Training data Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri McGillivray 2018) merged, stopwords removed according to the listmade by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613. Models Count-based Software used: LSCDetection (Kaiser et al. 2021;https://github.com/Garrafao/LSCDetection) a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set. b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set. Word2Vec Software used: Gensim library (Řehůřek and Sojka, 2010) a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set. b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set. References Al-Ghezi, Ragheb Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78. Bamman, D. Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer. Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1). Grover, Aditya Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 16), 855-864. Haug, Dag T. T. Marius L. Jhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 2734. Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen Djam Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117. Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Schlechtweg, Dominik, Anna Htty, Marco del Tredici Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL. Vatri, Alessandro Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature.Research Data Journal for the Humanities and Social Sciences3, 1, 55-65, Available From: Brillhttps://doi.org/10.1163/24523666-01000013 Vierros, Marja Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.

This page was built for dataset: Ancient Greek language models