Books+ Search Results

CAMIO transctiption languages

Title
CAMIO transctiption languages / Linguistic Data Consortium.
Publication
[Philadelphia, PA] : [Linguistic Data Consortium], [2023]
Physical Description
1 online resource
Local Notes
Access is available to the Yale community.
Notes
Authors: Michael Arrigo, Stephanie Strassel, Christopher Caruso.
Data source: web collection.
Data type: still image, text.
Applications: keyword spotting, language identification, OCR decoding, script identificaton, text localizaton.
LDC number: LDC2022T07.
In English, Arabic, Chinese, Persian, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, Vietnamese.
Title from resource home page (LDC website, viewed Febraury 27, 2023).
Access and use
Access restricted by licensing agreement.
Summary
"CAMIO Transcription Languages (LDC2022T07) was developed by the Linguistic Data Consortium and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in the following 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique script types. The CAMIO (Corpus of Annotated Multilingual Images for OCR) collection was designed to address gaps in language and script coverage from existing corpora and to support future evaluation of OCR capabilities through a systematically constructed data set. Data: Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes. For the 13 languages represented in this release, 1250 images/language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus. The script for each language is indicated in parentheses: Arabic (Arabic), Chinese (Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese (Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil (Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin)." --LDC online catalog.
Variant and related titles
Corpus of Annotated Multilingual Images for OCR transcription languages
Format
Books / Data Sets / Online
Language
Multiple languages; English; Arabic; Chinese; Persian; Hindi; Japanese; Kannada; Korean; Russian; Tamil; Thai; Urdu; Vietnamese
Added to Catalog
February 27, 2023
Contents
data file (contains the corpus data subdivided by language and by file type)
docs file (contains additional documentation, guidelines, and a file table)
dtds file (contains the associated dtds for the XML formats).
Genre/Form
Data sets.
Text corpora.
Also listed under
Arrigo, Michael, creator.
Linguistic Data Consortium, issuing body.
Citation

Available from:

Loading holdings.
Unable to load. Retry?
Loading holdings...
Unable to load. Retry?