CIEMPIESS experimentation

Advanced Search

Basic Search

Help

AND OR NOT

Add a row

Reset

Limit results by

Books+ Search Results

Title

CIEMPIESS experimentation / Linguistic Data Consortium.

ISBN

1585638846

Publication

[Philadelphia, PA] : [Linguistic Data Consortium], [2019]

Physical Description

1 online resource

Local Notes

Access is available to the Yale community.

Notes

Applications: speech recognition.

Authors: Daniel Hernández Mena.

Data source: microphone speech, broadcast conversation.

LDC number: LDC2019S07.

In Spanish.

Title from resource home page (LDC website, viewed September 28, 2020).

Access and use

Access restricted by licensing agreement.

Summary

"CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website. CIEMPIESS Experimentation is a set of three different data sets, specifically Complementary, Fem and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 10 hours of broadcast speech and transcripts and is intended for use as a standard test data set alongside other CIEMPIESS corpora. See the included documentation for more details on each corpus. he majority of the speech recordings in Fem and Test were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). Those two channels feature videos with speech around legal issues and topics related to UNAM. The Complementary recordings consist of read speech collected for that corpus. Complementary includes specifications for creating transcripts using the phonetic alphabet Mexbet and for converting Mexbet output to the International Phonetic Alphabet and X-SAMPA. An automatic phonetizer for Mexbet, written in Python 2.7, to create pronouncing dictionaries is provided as well. The audio files are presented as 16 kHz, 16-bit PCM flac format for this release. Transcripts are presented as UTF-8 encoded plain text." --LDC online catalog.

Format

Audio / Data Sets / Online

Language

Spanish

Added to Catalog

September 28, 2020

Subjects

Broadcast journalism > Language > Mexico.

Spanish language > Spoken Spanish > Mexico.

Genre/Form

Data sets.

Sound recordings.