LEADER 04739cim a2200709 i 4500001 16996695 005 20240219103149.0 006 m o h 007 cr||na|||||||| 007 sr|||||||||||| 008 240219p2024 paunnn o nn kur d 024 8 7123024736192 |qISLRN 040 CtY |beng |erda |cCtY 041 0 kur 050 4 PK6901 090 yuldset 090 yuldsetsnd 090 yuldsetxt 245 00 KASET, Kurmanji and Sorani Kurdish speech and transcripts / |cLinguistic Data Consortium. 246 3 Kurmanji and Sorani Kurdish speech and transcripts 264 1 [Philadelphia, PA] : |b[Linguistic Data Consortium], |c[2024] 300 1 online resource 336 computer dataset |bcod |2rdacontent 336 spoken word |bspw |2rdacontent 336 computer program |bcop |2rdacontent 337 computer |bc |2rdamedia 338 online resource |bcr |2rdacarrier 347 audio file |2rdaft 347 |bFLAC 588 Title from resource home page (LDC website, viewed February, 2024). 506 Access restricted by licensing agreement. 590 Access is available to the Yale community. 500 Authors: Dana Delgado, Kevin Walker, Stephanie Strassel, David Graff, Christopher Caruso 500 Data source: broadcast news, telephone conversations. 500 Data type: Sound, Text 500 Applications: language identification, speech recognition. 500 LDC number: LDC2024S01. 546 Audio in Central Kurdish, Northern Kurdish and Kurdish. 500 "Data: Twenty-one native Kurdish speakers living in the continental United States each made a minimum of ten phone calls, lasting up to ten minutes, to a family member or friend living in North America on a topic of their choice. LDC also collected multiple streaming radio and television broadcast programs (narrowband and wideband audio), many of which contained a mix of Kurmanji and Sorani Kurdish. Telephone recordings were collected via LDC's telephone speech collection system as two-channel 8-bit μ-law, 8-KHz sample rate. Broadcast audio was captured as single channel recordings at a sample rate of 16-KHz. All audio is stored in flac-compressed format. Native speaker auditors identified a 5-10 minute span from each broadcast recording for transcription. Full telephone recordings that passed the native speaker audit were transcribed. The four main fields on each line on each line of the transcript files (start_offset, end_offset, speaker_label, transcript_text) are separated by tabs. Each contains a list of time-stamped segments in order according to their start_offset values, with no blank lines. The transcripts are presented as plain-text, tab-delimited files with UTF-8 character encoding. This release includes speaker information, such as gender, year of birth, and language."--LDC online catalog. 505 0 data file (contains the audio files) - docs/README.txt file (contains general information for the corpus; describes the contents of other documentation in the folder) 520 "KASET - Kurmanji and Sorani Kurdish Speech and Transcripts (LDC2024S01) was developed by the Linguistic Data Consortium (LDC) and consists of approximately 147 hours of telephone conversations (289 recordings) and broadcast news (410 recordings) in two Kurdish dialects: Kurmanji Kurdish and Sorani Kurdish. Corresponding transcripts covering approximately 60 hours of the recordings are included in this release. Kurdish is spoken primarily in Turkey, Iran, Iraq, and Syria. Sorani and Kurmanji are the two widely-spoken dialects of the Kurdish language."--LDC online catalog. 650 0 Kurdish language |xData processing. 650 0 Kurdish language |xSpoken Kurdish |xData processing. 650 0 Kurdish language |xTranscription. 650 0 Conversation |xData processing. 650 0 Automatic speech recognition. 650 0 Audio data mining. 655 7 Data sets. |2lcgft 655 7 Speech corpora. |2lcgft 655 7 Sound recordings. |2lcgft 655 7 Excerpts. |2lcgft 700 1 Delgado, Dana, |ecreator. 710 2 Linguistic Data Consortium, |eissuing body. 787 08 |iRelated work: |tIARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a 852 80 |zOnline resource 856 40 |yOnline dataset |uhttps://ssrs.yale.edu/data/ssda/ldc/LDC2024S01/ 856 42 |3Documentation |uhttps://catalog.ldc.upenn.edu/docs/LDC2024S01/ 901 PK6901 902 Yale Internet Resource |bYale Internet Resource >> None|DELIM|16877254 905 online resource 907 2024-02-19T09:52:35.000Z 946 DO NOT EXPORT.