A Brazilian Portuguese Dataset for Offline Handwritten Text Recognition (BRESSAY)

DOI10.5281/zenodo.11637681ZenodoMaRDI QIDQ6726100FDO

Authors Hugo J. F. Hazin, Pedro H. V. Rocha, MacIleide F. Oliveira, Sávio S. Araújo, Kléberson F. Alves, Wiliane M. A. S. Souza, Byron L. D. Bezerra, Alejandro H. Toselli, Arthur F. S. Neto, Samara V. S. Lins

Publication date 1 June 2024

Copyright license Creative Commons Attribution 4.0 International

Description

The BRESSAY dataset comprises images of handwritten essays in Brazilian Portuguese, which present a series of challenges to optical recognition models. These images were sourced from multiple online platforms, limiting our ability to standardize the capture process. Due to these varied sources and the lack of a uniform collection method, the dataset provides a realistic reflection of real-world conditions. Each essay is unique, contributed by different writers, and addresses a specific content topic. Furthermore, the constraints placed on the writers often lead to various handwriting scenarios, including hard-to-read words, connected words, noise, overwriting, and struck-through texts. Technical Details The BRESSAY dataset represents a comprehensive collection of handwritten essays in Brazilian Portuguese, offering detailed insights into various handwriting scenarios. It covers a total of 1,000 pages, each contributed by a unique writer, resulting in 1,000 distinct handwriting styles. This aspect of the dataset adds a layer of diversity, which is further emphasized by the total of 4,214 paragraphs, 30,090 lines, and 416,826 words. Regarding unique tokens, we have 41,318 unique words, and 107 unique characters. Data Structure The dataset is organized as follows: data/: Main folder containing segmented essay images lines/: Images of individual lines PNG files: Line images TXT files: Transcriptions of lines pages/: Full page essay images PNG files: Page images TXT files: Transcriptions of pages paragraphs/: Images of paragraphs PNG files: Paragraph images TXT files: Transcriptions of paragraphs words/: Images of individual words PNG files: Word images TXT files: Transcriptions of words sets/: Contains partition files test.txt: Names of images in the test set validation.txt: Names of images in the validation set training.txt: Names of images in the training set Dataset Usage and Annotations Each name in test.txt, validation.txt and training.txt represents the name of the page and all its content (words, lines, paragraphs) must be in the respective partition. Annotations used in the dataset: ##@@???@@##: Superscript text that has become unidentifiable and unreadable. $$@@???@@$$: Subscript text that has become unidentifiable and unreadable. @@???@@: Text that cannot be read or identified due to its illegibility. ##--xxx--##: Text that has been added as a superscript and subsequently crossed out, rendering it illegible. $$--xxx--$$: Text that has been added as a subscript and subsequently crossed out, rendering it illegible. --xxx--: Text that has been crossed out in a way that makes it unreadable. ##--text--##: Text that has been added as a superscript and subsequently crossed out, but remains legible. $$--text--$$: Text that has been added as a subscript and subsequently crossed out, but remains legible. ##text##: Text added as a superscript in the line, typically as a correction or additional note. $$text$$: Text added as a subscript in the line, typically as a correction or additional note. --text--: Text that has been crossed out but remains readable.

This page was built for dataset: A Brazilian Portuguese Dataset for Offline Handwritten Text Recognition (BRESSAY)