IGN Synthetic Train Data for ICDAR'25 MapText Competition

DOI10.5281/zenodo.14704475Zenodo14704475MaRDI QIDQ6703749FDOQ6703749

Dataset published at Zenodo repository.

Solenn Tual, Joseph Chazalon, Julien Perret, Nathalie Abadie, Bertrand Duménieu

Publication date: 20 January 2025

Copyright license: Creative Commons Attribution 4.0 International

Data set of 2Kx2K synthetic image tiles for the ICDAR'25 Competition on Historical Map Text Detection, Recognition, and Linking. Annotations and images follow the format described at the competition website and can be evaluated using the official evaluation repository script. This synthetic dataset is supplementary to the dataset of real tiles IGN Train and Validation Data for ICDAR'25 MapText Competition. This synthetic training set mimics the style (background and fonts) of the original maps, and leverages the actual, modern land use database from the French government to generate realistic geometries and names from similar geographic areas (both in terms of vocabulary and urban density). This synthetic data is meant to be used as a supplementary training set, and is organized as such. We also provide a sample for fast download and code testing, containing only the images and ground truth for the first 10 images of the dataset. Synthetic Train Sample (in sample.zip) Annotations ign25synth_train.json (same) Images synthtrain.zip (same) Files ign25synth/train/*.jpg (same) Tiles 18,073 10 Map Sheets a dozen of different styles 1 style Words 1,622,398 114 Label Groups 1,489,072 91 Illegible Words 33 0 Truncated Words 79,972 2 Valid Words 1,542,426 112 All data used to generate this dataset is public domain. Finally, a style_sample.zip file provides some examples for each rendering style. The images it contains are extracted from the main dataset and should not be added to it. ℹ️ This version 3 features some improvements which impact all ZIP files: Some text region were rendered but not added to the ground truth this is now fixed. Truncation detection was improved, but this should not change the actual content. Some images were generated with wrong shapes, leading to them being discarded in the final dataset. They are now exported correctly and included in the new version. As a result, the new dataset is larger. Finally, we added some sample images for each style in a new style_sample.zip file. ℹ️ Version 2 added a fix in the ign25synth_train.json file from which very small regions (1 square pixel) were removed to mitigate evaluation issues. This results in a smaller number of total words and groups, but the number of valid words remains the same compared to version 1. The sample insample.zip and the images inign25synth_train.zip were not changed and are identical to version 1.

This page was built for dataset: IGN Synthetic Train Data for ICDAR'25 MapText Competition