piecemaker

From MaRDI portal
Software:111109



CRANpiecemakerMaRDI QIDQ111109

Tools for Preparing Text for Tokenizers

Jon Harmon, Jonathan Bratt

Last update: 2 June 2023

Copyright license: Apache License

Software version identifier: 1.0.1, 1.0.0, 1.0.2



Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer.