CAMIO transctiption languages

Advanced Search

Basic Search

Help

AND OR NOT

Add a row

Reset

Limit results by

Books+ Search Results

Title

CAMIO transctiption languages / Linguistic Data Consortium.

Publication

[Philadelphia, PA] : [Linguistic Data Consortium], [2023]

Physical Description

1 online resource

Local Notes

Access is available to the Yale community.

Notes

Authors: Michael Arrigo, Stephanie Strassel, Christopher Caruso.

Data source: web collection.

Data type: still image, text.

Applications: keyword spotting, language identification, OCR decoding, script identificaton, text localizaton.

LDC number: LDC2022T07.

In English, Arabic, Chinese, Persian, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, Vietnamese.

Title from resource home page (LDC website, viewed Febraury 27, 2023).

Access and use

Access restricted by licensing agreement.

Summary

"CAMIO Transcription Languages (LDC2022T07) was developed by the Linguistic Data Consortium and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in the following 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique script types. The CAMIO (Corpus of Annotated Multilingual Images for OCR) collection was designed to address gaps in language and script coverage from existing corpora and to support future evaluation of OCR capabilities through a systematically constructed data set. Data: Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes. For the 13 languages represented in this release, 1250 images/language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus. The script for each language is indicated in parentheses: Arabic (Arabic), Chinese (Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese (Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil (Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin)." --LDC online catalog.

Variant and related titles

Corpus of Annotated Multilingual Images for OCR transcription languages

Format

Books / Data Sets / Online

Language

Multiple languages; English; Arabic; Chinese; Persian; Hindi; Japanese; Kannada; Korean; Russian; Tamil; Thai; Urdu; Vietnamese

Added to Catalog

February 27, 2023

Contents

data file (contains the corpus data subdivided by language and by file type)

docs file (contains additional documentation, guidelines, and a file table)

dtds file (contains the associated dtds for the XML formats).