The University of Toronto is a subscriber to the Linguistic Data Consortium which licenses language corpora and other language resources. For more information about the LDC, please visit their website.
The following is a list of corpora that U of T has licensed from the LDC over the years. These may be downloaded by U of T students, staff, and faculty. After clicking one of the links you must review the terms of use before accessing the data. A few corpora are too large for download; please contact us to access these datasets.
This list does not include all corpora available from LDC, so we encourage you to also browse the full list of corpora on the LDC website. If LDC offers a corpus you need but which is not listed on this page, please get in touch with us, as we may be able to obtain it on your behalf.
2024
- LDC2024S01 - KASET - Kurmanji and Sorani Kurdish Speech and Transcripts - Description - Contact us for data access
- LDC2024S03 - RATS Low Speech Density - Description - Contact us for data access
- LDC2024S05 - Call My Net 1 - Description - Contact us for data access
- LDC2024S06 - Diaspora Tibetan Speech - Description - Download
- LDC2024S08 - Dialogs Re-Enacted Across Languages - Description - Download
- LDC2024T01 - LORELEI Farsi Representative Language Pack - Description - Contact us for data access
- LDC2024T02 - AIDA Scenario 1 Practice Topic Annotation - Description - Download
- LDC2024T03 - LoReHLT Hausa Representative Language Pack - Description - Download
- LDC2024T05 - Automatic Content Extraction for Portuguese - Description - Download (note: contains corrected files, updated July 15, 2024)
- LDC2024T06 - AIDA Scenario 2 Practice Topic Annotation - Description - Download
- LDC2024T07 - LORELEI Uyghur Incident Language Pack - Description - Download
- LDC2024T08 - RST Continuity Corpus - Description - Download
- LDC2024T09 - MultiTACRED - Description - Download
- LDC2024T10 - LORELEI Yoruba Representative Language Pack - Description - Download
- LDC2024T11 - Abstract Meaning Representation 3.0 - Machine Translations - Description - Download
2023
- LDC2023S01 - AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts - Description - Contact us for data access
- LDC2023S02 - Mixer 3 Speech - Description - Documents - Contact us for data access
- LDC2023S03 - 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge - Description - Contact us for data access
- LDC2023S04 - Mixer 7 Spanish Speech - Description - Contact us for data access
- LDC2023S06 - 2019 OpenSAT Public Safety Communications Simulation - Description - Contact us for data access
- LDC2023S08 - CALLFRIEND Russian Speech - Description - Download
- LDC2023S09 - REMIX Telephone Collection - Description - Contact us for data access
- LDC2023S10 - Kasdi-Merbah (University) Emotional Database in Arabic Speech - Description - Download
- LDC2023T01 - LORELEI Swahili Representative Language Pack - Description - Download
- LDC2023T02 - LORELEI Tagalog Representative Language Pack - Description - Download
- LDC2023T03 - LORELEI Tamil Representative Language Pack - Description - Download
- LDC2023T04 - DEFT English Light and Rich ERE Annotation - Description - Download
- LDC2023T05 - Penn Korean Universal Dependency Treebank - Description - Download
- LDC2023T06 - LORELEI Zulu Representative Language Pack - Description - Download
- LDC2023T07 - LORELEI Indonesian Representative Language Pack - Description - Download
- LDC2023T08 - LORELEI Thai Representative Language Pack - Description - Contact us for data access
- LDC2023T09 - CALLFRIEND Russian Text - Description - Download
- LDC2023T10 - AIDA Scenario 1 and 2 Reference Knowledge Base - Description - Download
- LDC2023T11 - AIDA Scenario 1 Practice Topic Source Data - Description - Contact us for data access
- LDC2023T13 - TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 - Description - Download
- LDC2023V01 - 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual - Description - Contact us for data access
2022
- LDC2022L01 - Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon - Description - Download
- LDC2022S01 - 2017 NIST OpenSAT Pilot - SSSF - Description - Download
- LDC2022S02 - The Child Subglottal Resonances Database - Description - Download
- LDC2022S04 - NUBUC - Description - Download
- LDC2022S06 - Second DIHARD Challenge Evaluation - Eleven Sources - Description - Download
- LDC2022S09 - Xi'an Guanzhong Object Naming - Description - Download
- LDC2022S10 - 2017 NIST Language Recognition Evaluation Training and Development Sets - Description - Contact us for data access
- LDC2022S12 - Third DIHARD Challenge Development - Description - Download
- LDC2022S13 - Global TIMIT Thai - Description - Download
- LDC2022S14 - Third DIHARD Challenge Evaluation - Description - Download
- LDC2022T01 - LORELEI Kinyarwanda Incident Language Pack - Description - Download
- LDC2022T02 - AttImam - Description - Download
- LDC2022T03 - LORELEI Wolof Representative Language Pack - Description - Download
- LDC2022T04 - Qatari Corpus of Argumentative Writing - Description - Download
- LDC2022T05 - LORELEI Bengali Representative Language Pack - Description - Contact us for data access
- LDC2022T06 - BOLT English Translation Treebank - Egyptian Arabic SMS/Chat - Description - Download
- LDC2022T07 - CAMIO Transcription Languages - Description - Contact us for data access
- LDC2022V01 - HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation - Description - Contact us for data access
- LDC2022V02 - HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation - Description - Contact us for data access
2021
- LDC2021L01 - Classical Arabic Dictionary - Description - Download
- (Special agreement) LDC2021S02 - Columbia Games Corpus - Description - Download
- LDC2021S03 - Global TIMIT Mandarin Chinese - Description - Download
- (Special agreement) LDC2021S05 - MyST Children's Conversational Speech - Description - Contact us for data access
- (Special agreement) LDC2021S06 - Ethnobotanical Research and Language Documentation of Nahuatl - Description - Contact us for data access
- LDC2021S07 - Wikipedia Spanish Speech and Transcripts - Description - Download
- LDC2021S08 - RATS Speaker Identification - Description - Contact us for data access
- LDC2021S09 - UCLA Speaker Variability Database - Description - Download
- LDC2021S10 - Second DIHARD Challenge Development - Eleven Sources - Description - Download
- LDC2021T02 - LORELEI Akan Representative Language Pack - Description - Download
- LDC2021T03 - BOLT English Treebank - SMS/Chat - Description - Download
- LDC2021T04 - ATIS - Seven Languages - Description - Download
- LDC2021T05 - Penn Discourse Treebank Version 2.0 - German Translation - Description - Download
- LDC2021T06 - TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 - Description - Download
- LDC2021T07 - BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech - Description - Download
- LDC2021T08 - TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014 - Description - Download
- LDC2021T09 - X-SRL: Parallel Cross-lingual Semantic Role Labeling - Description - Download
- LDC2021T10 - ESPADA - Description - Download
- LDC2021T11 - BOLT Chinese SMS/Chat Parallel Training Data- Description - Download
- LDC2021T12 - BOLT Egyptian Arabic Treebank - Conversational Telephone Speech - Description - Download
- LDC2021T13 - Chinese Abstract Meaning Representation 2.0 - Description - Download
- LDC2021T14 - BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech - Description - Download
- LDC2021T15 - BOLT Egyptian Arabic SMS/Chat Parallel Training Data - Description - Download
- LDC2021T16 - DiscAlign for Penn and RST Discourse Treebanks - Description - Download
- LDC2021T17 - BOLT Egyptian Arabic Treebank - SMS/Chat - Description - Download
- LDC2021T18 - BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech - Description - Download
- LDC2021T19 - BOLT English Translation Treebank - Chinese SMS/Chat - Description - Download
- LDC2021V01 - HAVIC MED Training Data -- Videos, Metadata and Annotation - Description - Contact us for data access
2020
- Special COVID-19 data release - complete your own application on the LDC website to access (no cost)
- LDC2020L02 - Chinese Lexical Resources for Gender, Number, Animacy - Description - Download
- LDC2020S01 - LibriVox Spanish - Description - Contact us for data access
- LDC2020S04 - 2018 NIST Speaker Recognition Evaluation Test Set - Description - Contact us for data access
- LDC2020S05 - Multi-Language Conversational Telephone Speech 2011 - Mandarin Chinese - Description - Download
- LDC2020S06 - CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition - Description - Download
- LDC2020S08 - CALLFRIEND American English-Southern Dialect Second Edition - Description - Download
- LDC2020S09 - Global TIMIT Learner Treebank English - Description - Download
- LDC2020S11 - Global TIMIT Learner Simple English - Description - Download
- LDC2020S12 - Global TIMIT Mandarin Chinese-Guanzhong Dialect - Description - Download
- LDC2020S13 - Phonemes of Arabic - Description - Download
- LDC2020T01 - Chinese CogBank - Description - Download
- LDC2020T02 - Abstract Meaning Representation (AMR) Annotation Release 3.0 - Description - Download
- LDC2020T03 - AC KBP English Event Argument - Training and Evaluation Data 2014-2015 - Description - Download
- LDC2020T04 - Machine Reading Phase 1 IC Training Data - Description - Download
- LDC2020T05 - BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training - Description - Download
- LDC2020T07 - Abstract Meaning Representation 2.0 - Four Translations - Description - Download
- LDC2020T08 - TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013 - Description - Download
- LDC2020T09 - BOLT English Translation Treebank - Chinese Discussion Forum - Description - Download
- LDC2020T10 - LORELEI Entity Detection and Linking Knowledge Base - Description - Download
- LDC2020T11 - LORELEI Oromo Incident Language Pack - Description - Download
- LDC2020T13 - TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-15 - Description - Download
- LDC2020T14 - Speech Sentiment Annotations - Description - Download
- LDC2020T15 - BOLT Chinese-English Word Alignment and Tagging - Conversational Telephone Speech Training - Description - Download
- LDC2020T16 - Penn Parsed Corpora of Historical English - Description - Download
- LDC2020T17 - LORELEI Vietnamese Representative Language Pack - Description - Contact us for data access
- LDC2020T18 - TAC KBP Event Argument - Comprehensive Training and Evaluation Data 2016-2017 - Description - Download
- LDC2020T19 - DEFT Chinese Light and Rich ERE Annotation - Description - Download
- LDC2020T20 - BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech - Description - Download
- LDC2020T21 - BOLT English PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech - Description - Download
- LDC2020T22 - LORELEI Tigrinya Incident Language Pack - Description - Download
- LDC2020T24 - LORELEI Ukrainian Representative Language Pack - Description - Contact us for data access
2019
- LDC2019S02 - Multi-Language Conversational Telephone Speech 2011 -- Arabic Group - Description - Contact us for data access
- LDC2019S04 - CALLFRIEND Egyptian Arabic Second Edition - Description - Download
- LDC2019S05 - VAST Chinese Speech and Transcripts - Description - Download
- LDC2019S06 - Multi-Language Conversational Telephone Speech 2011 -- English Group - Description - Download
- LDC2019S07 - CIEMPIESS Experimentation - Description - Download
- LDC2019S09 - First DIHARD Challenge Development - Eight Sources - Description - Download
- LDC2019S12 - First DIHARD Challenge Evaluation - Nine Sources - Description - Download
- LDC2019S14 - The DKU-JNU-EMA Electromagnetic Articulography Database - Description - Download
- LDC2019S15 - Multi-Language Conversational Telephone Speech 2011 -- East Asian - Description - Download
- LDC2019S18 - CALLFRIEND Canadian French Second Edition - Description - Download
- LDC2019S19 - Polish Speech Database - Description - Contact us for data access
- LDC2019S20 - 2016 NIST Speaker Recognition Evaluation Test Set - Description - Contact us for data access
- LDC2019S21 - CALLFRIEND American English-Non-Southern Dialect Second Edition - Description - Download
- LDC2019S23 - Magic Data Chinese Mandarin Conversational Speech - Description - Contact us for data access
- LDC2019T01 - BOLT Arabic Discussion Forum Parallel Training Data - Description - Download
- LDC2019T02 - TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 - Description - Contact us for data access
- LDC2019T03 - DEFT Chinese Committed Belief Annotation - Description - Download
- LDC2019T04 - Multilingual ATIS - Description - Download
- LDC2019T05 - Penn Discourse Treebank Version 3.0 - Description - Download
- LDC2019T06 - BOLT Egyptian-English Word Alignment -- Discussion Forum Training - Description - Download
- LDC2019T07 - Chinese Abstract Meaning Representation 1.0 - Description - Download
- LDC2019T08 - TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 - Description - Download
- LDC2019T09 - DEFT Spanish Committed Belief Annotation - Description - Download
- LDC2019T10 - Phrase Detectives Corpus Version 2 - Description - Download
- LDC2019T11 - Corpus of Conversational Persian Transcripts - Description - Download
- LDC2019T12 - TAC KBP Evaluation Source Corpora 2016-2017 - Description - Download
- LDC2019T13 - BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training - Description - Download
- LDC2019T14 - Machine Reading Phase 1 NFL Scoring Training Data - Description - Download
- LDC2019T15 - BOLT English Treebank - Discussion Forum - Description - Download
- LDC2019T16 - DEFT English Committed Belief Annotation - Description - Download
- LDC2019T17 - TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 - Description - Download
- LDC2019T18 - BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training - Description - Download
- LDC2019T19 - TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 - Description - Download
- LDC2019V01 - HAVIC MED Progress Test -- Videos, Metadata and Annotation - Description - Contact us for data access
2018
- LDC2018S03 - Multi-Language Conversational Telephone Speech 2011 -- Central Asian - Description - Download
- LDC2018S04 - Rhythm and Pitch - Description - Download
- LDC2018S05 - GALE Phase 4 Arabic Broadcast News Speech - Description - Download
- LDC2018S06 - 2011 NIST Language Recognition Evaluation Test Set - Description - Download
- LDC2018S08 - Multi-Language Conversational Telephone Speech 2011 -- Central European - Description - Download
- LDC2018S09 - CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition - Description - Download
- LDC2018S10 - RATS Language Identification - Description - Contact us for data access
- LDC2018S11 - CIEMPIESS Balance - Description - Download
- LDC2018S12 - Multi-Language Conversational Telephone Speech 2011 -- Spanish - Description - Download
- LDC2018S14 - AISHELL-1 - Description - Contact us for data access
- LDC2018S15 - Avatar Education Portuguese - Description - Download
- LDC2018S18 - HUB5 Mandarin Telephone Speech and Transcripts Second Edition - Description - Download
- LDC2018T01 - DEFT Spanish Treebank - Description - Download
- LDC2018T03 - TAC KBP Comprehensive English Source Corpora 2009-2014 - Description - Contact us for data access
- LDC2018T04 - LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text - Description - Download
- LDC2018T06 - 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish - Description - Download
- LDC2018T08 - 2007 CoNLL Shared Task - Arabic & English - Description - Download
- LDC2018T09 - SPADE - Description - Download
- LDC2018T10 - BOLT Arabic Discussion Forums - Description - Contact us for data access
- LDC2018T11 - LORELEI Somali Representative Language Pack - Monolingual and Parallel Text - Description - Download
- (Special agreement) LDC2018T12 - Concretely Annotated New York Times - Description - Contact us for data access
- LDC2018T14 - GALE Phase 4 Arabic Broadcast News Transcripts - Description - Download
- LDC2018T15 - BOLT Chinese SMS/Chat - Description - Download
- LDC2018T16 - TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 - Description - Download
- LDC2018T18 - BOLT Information Retrieval Comprehensive Training and Evaluation - Description - Download
- LDC2018T19 - BOLT English SMS/Chat - Description - Download
- LDC2018T20 - Concretely Annotated English Gigaword - Description - Contact us for data access
- LDC2018T22 - TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 - Description - Download
- LDC2018T23 - BOLT Egyptian Arabic Treebank - Discussion Forum - Description - Download
- LDC2018T24 - TAC Relation Extraction Dataset - Description - Download
- LDC2018V01 - HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation - Description - Contact us for data access
2017
- (Special agreement) LDC2017S01 - IARPA Babel Vietnamese Language Pack IARPA-babel107b-v.0.7 - Description - Contact us for data access
- (Special agreement) LDC2017S03 - IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b - Description - Contact us for data access
- (Non-member agreement) LDC2017S24 - CHiME3 - Description - Contact us for data access
- (Non-member agreement) LDC2017T14 - Ancient Chinese Corpus - Description - Download
2016
- LDC2016T13 - Chinese Treebank 9.0 - Description - Download
2015
- (Non-member agreement) LDC2015E21 - CoNLL-2015 Shared Task on Shallow Discourse Parsing - Training and Development Data - Description - Download
- (Non-member agreement) LDC2015T08 - Coordination Annotation for the Penn Treebank - Description - Download - note: this is the revised data for LDC99T42
- (Non-member agreement) LDC2015T13 - English News Text Treebank: Penn Treebank Revised - Description - Download
2014
- (Non-member agreement) LDC2014T21 - Chinese Discourse Treebank 0.5 - Description - Download
2013
- LDC2013S03 - Mixer 6 Speech - Description - Contact us for data access
- LDC2013S05 - Greybeard - Description - Contact us for data access
- (Special agreement) LDC2013S09 - CSC Deceptive Speech - Description - Download
- LDC2013T14 - GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 - Description - Download
- LDC2013T19 - OntoNotes Release 5.0 - Description - Download
- LDC2013T22 - The ARRAU Corpus of Anaphoric Information - Description - Download
2012
- (Non-member agreement) LDC2012T13 - English Web Treebank - Description - Download
- (Non-member agreement) LDC2012T21 - Annotated English Gigaword - Description - Contact us for data access
2011
- LDC2011S01 - 2005 NIST Speaker Recognition Evaluation Training Data - Description - Contact us for data access
- LDC2011S04 - 2005 NIST Speaker Recognition Evaluation Test Data - Description - Contact us for data access
- LDC2011S05 - 2008 NIST Speaker Recognition Evaluation Training Set Part 1 - Description - Contact us for data access
- LDC2011S06 - 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set - Description - Contact us for data access
- LDC2011S08 - 2008 NIST Speaker Recognition Evaluation Test Set - Description - Contact us for data access
- LDC2011S09 - 2006 NIST Speaker Recognition Evaluation Training Set - Description - Contact us for data access
- LDC2011S10 - 2006 NIST Speaker Recognition Evaluation Test Set Part 1 - Description - Contact us for data access
- LDC2011T01 - SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages - Description - Download
- LDC2011T03 - OntoNotes Release 4.0 - Description - Contact us for data access
- LDC2011T06 - Broadcast News Lattices - Description - Download
- LDC2011T07 - English Gigaword Fifth Edition - Description - Contact us for data access
- LDC2011T09 - Arabic Treebank: Part 2 v 3.1 - Description - Download
2010
- LDC2010T07 - Chinese Treebank 7.0 - Description - Download
2009
- (Special agreement) LDC2009S01 - CSLU: Numbers Version 1.3 - Description - Download
- (Special agreement) LDC2009S03 - CSLU: S4X Release 1.2 - Description - Download
- (Special agreement) LDC2009T26 - NXT Switchboard Annotations - Description - Contact us for data access
2008
- (Special agreement) LDC2008S01 - CSLU: Portland Cellular Telephone Speech Version 1.3 - Description - Download
- (Special agreement) LDC2008S02 - CSLU: National Cellular Telephone Speech Release 2.3 - Description - Download
- (Non-member agreement) LDC2008S04 - West Point Brazilian Portuguese Speech - Description - Download
- (Special agreement) LDC2008S06 - CSLU: Alphadigit Version 1.3 - Description - Download
- (Special agreement) LDC2008S07 - CSLU: ISOLET Spoken Letter Database Version 1.3 - Description - Download
- (Non-member agreement) LDC2008T05 - Penn Discourse Treebank Version 2.0 - Description - Download
- (Special agreement) LDC2008T19 - The New York Times Annotated Corpus - Description - Contact us for data access
- (Non-member agreement) LDC2008T23 - NomBank v 1.0 - Description - Download
- (Non-member agreement) LDC2008T24 - COMNOM v 1.0 - Description - Download
2007
- (Special agreement) LDC2007S05 - CSLU: Yes/No Version 1.2 - Description - Download
- (Special agreement) LDC2007S08 - CSLU: Foreign Accented English Release 1.2 - Description - Download
- (Non-member agreement) LDC2007S10 - 2003 NIST Rich Transcription Evaluation Data - Description - Download
- (Special agreement) LDC2007S13 - CSLU: Apple Words and Phrases - Description - Download
- (Special agreement) LDC2007S18 - CSLU: Kids` Speech Version 1.1 - Description - Contact us for data access
- (Non-member agreement) LDC2007T36 - Chinese Treebank 6.0 - Description - Download
2006
- (Special agreement) LDC2006S13 - N4 NATO Native and Non-Native Speech - Description - Contact us for data access
- LDC2006S26 - CSLU: Speaker Recognition Version 1.1 - Description - Download
- LDC2006S44 - 2004 NIST Speaker Recognition Evaluation - Description - Contact us for data access
- LDC2006T06 - ACE 2005 Multilingual Training Corpus - Description - Download
- LDC2006T08 - TimeBank 1.2 - Description - Download
- LDC2006T10 - English-Arabic Treebank v 1.0 - Description - Download
- (Special agreement) LDC2006T13 - Web 1T 5-gram Version 1 - Description - Contact us for data access
2005
- LDC2005S11 - TDT4 Multilingual Broadcast News Speech Corpus - Description - Download
- LDC2005S13 - Fisher English Training Part 2, Speech - Description - Contact us for data access
- LDC2005S15 - HKUST Mandarin Telephone Speech, Part 1 - Description - Contact us for data access
- LDC2005S25 - Santa Barbara Corpus of Spoken American English Part IV - Description - Download
- LDC2005T01 - Chinese Treebank 5.0 - Description - Download 5.0 - Download 5.1
- LDC2005T06 - Chinese News Translation Text Part 1 - Description - Download
- LDC2005T09 - ACE 2004 Multilingual Training Corpus - Description - Download
- LDC2005T10 - Chinese English News Magazine Parallel Text - Description - Download
- LDC2005T12 - English Gigaword Second Edition - Description - Contact us for data access
- LDC2005T13 - CCGbank - Description - Download
- LDC2005T16 - TDT4 Multilingual Text and Annotations - Description - Download
- LDC2005T19 - Fisher English Training Part 2, Transcripts - Description - Download
- LDC2005T20 - Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) - Description - Download
- LDC2005T30 - Arabic Treebank: Part 4 v 1.0 (MPG Annotation) - Description - Download
- LDC2005T32 - HKUST Mandarin Telephone Transcript Data, Part 1 - Description - Download
- LDC2005T33 - BBN Pronoun Coreference and Entity Type Corpus - Description - Download
- (Special agreement - 2 licenses to show, one for open portion, one for restricted portion of data) LDC2005T35 - American National Corpus (ANC) Second Release - Description - Download
2004
- LDC2004S02 - ICSI Meeting Speech - Description - Contact us for data access
- LDC2004S07 - Switchboard Cellular Part 2 Audio - Description - Contact us for data access
- LDC2004S10 - Santa Barbara Corpus of Spoken American English Part III - Description - Download
- LDC2004S11 - 2002 Rich Transcription Broadcast News and Conversational Telephone Speech - Description - Download
- LDC2004S13 - Fisher English Training Speech Part 1 Speech - Description - Contact us for data access
- LDC2004T04 - ICSI Meeting Transcripts - Description - Download
- LDC2004T08 - Hong Kong Parallel Text - Description - Download
- LDC2004T11 - Arabic Treebank: Part 3 v 1.0 - Description - Download
- LDC2004T14 - Proposition Bank I - Description - Download
- LDC2004T15 - 2000 Communicator Dialogue Act Tagged - Description - Download
- LDC2004T16 - 2001 Communicator Dialogue Act Tagged - Description - Download
- LDC2004T18 - Arabic English Parallel News Part 1 - Description - Download
- LDC2004T19 - Fisher English Training Speech Part 1 Transcripts - Description - Download
2003
- LDC2003S01 - 2001 Communicator Evaluation - Description - Contact us for data access
- LDC2003S06 - Santa Barbara Corpus of Spoken American English Part II - Description - Download
- LDC2003T06 - Arabic Treebank: Part 1 v 2.0 - Description - Download
- LDC2003T15 - SLX Corpus of Classic Sociolinguistic Interviews - Description - Download
- LDC2003T17 - Multiple-Translation Chinese (MTC) Part 2 - Description - Download
2002
- LDC2002L27 - Chinese-English Translation Lexicon Version 3.0 - Description - Download
- (Special agreement) LDC2002L49 - Buckwalter Arabic Morphological Analyzer Version 1.0 - Description - Contact us for data access
- LDC2002S04 - Translanguage English Database (TED) Speech - Description - Contact us for data access
- LDC2002S06 - Switchboard-2 Phase III Audio - Description - Contact us for data access
- LDC2002S09 - 2000 HUB5 English Evaluation Speech - Description - Download
- (Special agreement) LDC2002S11 - 1997 HUB4 English Evaluation Speech and Transcripts - Description - Contact us for data access
- LDC2002S28 - Emotional Prosody Speech and Transcripts - Description - Download
- LDC2002S56 - 2000 Communicator Evaluation - Description - Contact us for data access
- LDC2002T03 - Translanguage English Database (TED) Transcripts - Description - Download
- LDC2002T07 - RST Discourse Treebank - Description - Download
- LDC2002T38 - CALLHOME Egyptian Arabic Transcripts Supplement - Description - Download
- LDC2002T43 - 2000 HUB5 English Evaluation Transcripts - Description - Download
2001
- LDC2001S13 - Switchboard Cellular Part 1 Audio - Description - Contact us for data access
- LDC2001S97 - 2000 NIST Speaker Recognition Evaluation - Description - Download
- LDC2001T02 - Message Understanding Conference (MUC) 7 - Description - Download
- (Special agreement) LDC2001T10 - Prague Dependency Treebank 1.0 - Description - Contact us for data access
2000
- LDC2000S85 - Santa Barbara Corpus of Spoken American English Part I - Description - Download
- (Special agreement) LDC2000S86 - 1998 HUB4 Broadcast News Evaluation English Test Material - Description - Download
- LDC2000S87 - Speech in Noisy Environments (SPINE) Training Audio - Description - Download
- LDC2000S88 - 1999 HUB4 Broadcast News Evaluation English Test Material - Description - Download
- (Special agreement) LDC2000T43 - BLLIP 1987-89 WSJ Corpus Release 1 - Description - Download
- LDC2000T46 - Hong Kong News Parallel Text - Description - Download
- LDC2000T49 - Speech in Noisy Environments (SPINE) Training Transcripts -Description - Download
- LDC2000T50 - Hong Kong Hansards Parallel Text - Description - Download
1999
- (Non-member agreement) LDC99S78 - SUSAS - Description - Download
- (Non-member agreement) LDC99T42 - Treebank-3 - Description - Download - note: please see LDC2015T08 above for revised data
1998
- (Special agreement) LDC98L21 - COMLEX English Syntax Lexicon - Description - Download
- (Non-member agreement) LDC98S71 - 1997 English Broadcast News Speech (HUB4) - Description - Contact us for data access
- (Non-member agreement) LDC98T28 - 1997 English Broadcast News Transcripts (HUB4) - Description - Download
- (Special agreement) LDC98T31 - 1996 CSR HUB4 Language Model - Description - Download
1997
- LDC97S62 - Switchboard-1 Release 2 - Description - Contact us for data access
- (Special agreement) LDC97T22 - 1996 English Broadcast News Transcripts (HUB4) - Description - Download
1996
- (Special agreement) LDC96L14 - CELEX2 - Description - Contact us for data access
- (Non-member agreement) LDC96S60 - CALLFRIEND Vietnamese - Description - Download
- (Special agreement) LDC96T10 - Message Understanding Conference (MUC) 6 Additional News Text - Description - Contact us for data access
- (Special agreement) LDC96T11 - COMLEX Syntax Text Corpus Version 2.0 - Description - Contact us for data access
1995
- (Non-member agreement) LDC95S26 - ATIS3 Test Data - Description - Download
- (Non-member agreement) LDC95T6 - CSR-III Text - Desctiption - Download
- (Special agreement) LDC95T13 - Mandarin Chinese News Text - Description - Download
- (Non-member agreement) LDC95T7 - Treebank-2 - Description - Download
1994
- (Non-member agreement) LDC94S13B - CSR-II (WSJ1) Sennheiser - Description - Contact us for data access
- (Non-member agreement) LDC94S17 - OGI Multilanguage Corpus - Description - Download
- (Non-member agreement) LDC94S19 - ATIS3 Training Data - Description - Download
- (Non-member agreement) LDC94T5 - ECI Multilingual Text - Description - Download
1993
- (Non-member agreement) LDC93S1 - TIMIT Acoustic-Phonetic Continuous Speech Corpus - Description - Download
- (Non-member agreement) LDC93S10 - TIDIGITS - Description - Download flac file - Download comp file
- LDC93S3A - Resource Management Complete Set 2.0 - Description - Contact us for data access
- LDC93S5 - ATIS2 - Description - Contact us for data access
- LDC93S6B - CSR-I (WSJ0) Sennheiser - Description - Contact us for data access
- (Non-member agreement) LDC93S9 - TI 46-Word - Description - Download