Linguistic Data Consortium

The University of Toronto is a subscriber to the Linguistic Data Consortium which licenses language corpora and other language resources. For more information about the LDC, please visit their website

The following is a list of corpora that U of T has licensed from the LDC over the years. These may be downloaded by U of T students staff and faculty. After clicking one of the links you must review the terms of use before accessing the data. A few corpora are too large for download; please contact us to access these datasets.

This list does not include all corpora available from LDC, so we encourage you to also browse the full list of corpora on the LDC website. If LDC offers a corpus you need but which is not listed on this page, please get in touch with us, as we may be able to obtain it on your behalf.

2022

2021

  • LDC2021L01 - Classical Arabic Dictionary - Description - Download
  • (Special agreement) LDC2021S02 - Columbia Games Corpus - Description - Download
  • LDC2021S03 - Global TIMIT Mandarin Chinese - Description - Download
  • (Special agreement) LDC2021S05 - MyST Children's Conversational Speech - DescriptionContact us for data access
  • (Special agreement) LDC2021S06 - Ethnobotanical Research and Language Documentation of Nahuatl - DescriptionContact us for data access
  • LDC2021S07 - Wikipedia Spanish Speech and Transcripts - Description - Download
  • LDC2021S08 - RATS Speaker Identification - DescriptionContact us for data access
  • LDC2021S09 - UCLA Speaker Variability Database - Description - Download
  • LDC2021S10 - Second DIHARD Challenge Development - Eleven Sources - Description - Download
  • LDC2021T02 - LORELEI Akan Representative Language Pack - Description - Download
  • LDC2021T03 - BOLT English Treebank - SMS/Chat - Description - Download
  • LDC2021T04 - ATIS - Seven Languages - Description - Download
  • LDC2021T05 - Penn Discourse Treebank Version 2.0 - German Translation - Description - Download
  • LDC2021T06 - TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 - Description - Download
  • LDC2021T07 - BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech - Description - Download
  • LDC2021T08 - TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014 - Description - Download
  • LDC2021T09 - X-SRL: Parallel Cross-lingual Semantic Role Labeling - Description - Download
  • LDC2021T10 - ESPADA - Description - Download
  • LDC2021T11 - BOLT Chinese SMS/Chat Parallel Training Data- Description - Download
  • LDC2021T12 - BOLT Egyptian Arabic Treebank - Conversational Telephone Speech - Description - Download
  • LDC2021T13 - Chinese Abstract Meaning Representation 2.0 - Description - Download
  • LDC2021T14 - BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech - Description - Download
  • LDC2021T15 - BOLT Egyptian Arabic SMS/Chat Parallel Training Data - Description - Download
  • LDC2021T17 - BOLT Egyptian Arabic Treebank - SMS/Chat - Description - Download
  • LDC2021T18 - BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech - Description - Download
  • LDC2021T19 - BOLT English Translation Treebank - Chinese SMS/Chat - Description - Download
  • LDC2021V01 - HAVIC MED Training Data -- Videos, Metadata and Annotation - DescriptionContact us for data access

2020

2019

2018

2017

  • (Special agreement) LDC2017S03 - IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b - DescriptionContact us for data access
  • (Non-member agreement) LDC2017S24 - CHiME3 - Description - Contact us for data access
  • (Non-member agreement) LDC2017T14 - Ancient Chinese Corpus - Description - Download

2015

  • (Non-member agreement) LDC2015E21 - CoNLL-2015 Shared Task on Shallow Discourse Parsing - Training and Development Data - Description - Download
  • (Non-member agreement) LDC2015T08 - Coordination Annotation for the Penn Treebank - Description - Downloadnote: this is the revised data for LDC99T42
  • (Non-member agreement) LDC2015T13 - English News Text Treebank: Penn Treebank Revised - Description - Download

2014

2013

2012

2011

2009

2008

  • (Special agreement) LDC2008S01 - CSLU: Portland Cellular Telephone Speech Version 1.3 - DescriptionDownload
  • (Special agreement) LDC2008S02 - CSLU: National Cellular Telephone Speech Release 2.3 - DescriptionDownload
  • (Non-member agreement) LDC2008S04 - West Point Brazilian Portuguese Speech - Description - Download
  • (Special agreement) LDC2008S06 - CSLU: Alphadigit Version 1.3 - DescriptionDownload
  • (Special agreement) LDC2008S07 - CSLU: ISOLET Spoken Letter Database Version 1.3 - DescriptionDownload
  • (Non-member agreement) LDC2008T05 - Penn Discourse Treebank Version 2.0 - Description - Download
  • (Special agreementLDC2008T19 - The New York Times Annotated Corpus - Description - Contact us for data access
  • (Non-member agreement) LDC2008T23 - NomBank v 1.0 - Description - Download
  • (Non-member agreement) LDC2008T24 - COMNOM v 1.0 - Description - Download

2007

  • (Special agreement) LDC2007S05 - CSLU: Yes/No Version 1.2 - Description - Download
  • (Special agreement) LDC2007S08 - CSLU: Foreign Accented English Release 1.2 - DescriptionDownload
  • (Non-member agreement) LDC2007S10 - 2003 NIST Rich Transcription Evaluation Data - Description - Download
  • (Special agreement) LDC2007S13 - CSLU: Apple Words and Phrases - DescriptionDownload
  • (Special agreement) LDC2007S18 - CSLU: Kids` Speech Version 1.1 - DescriptionContact us for data access
  • (Non-member agreement) LDC2007T36 - Chinese Treebank 6.0 - Description - Download

2006

2005

2004

2003

2002

2001

2000

1999

  • (Non-member agreement) LDC99S78 - SUSAS - Description - Download
  • (Non-member agreement) LDC99T42 - Treebank-3 - Description - Download - note: please see LDC2015T08 above for revised data

1998

  • (Special agreement) LDC98L21 - COMLEX English Syntax Lexicon - Description - Download
  • (Non-member agreement) LDC98S71 - 1997 English Broadcast News Speech (HUB4) - Description - Contact us for data access
  • (Non-member agreement) LDC98T28 - 1997 English Broadcast News Transcripts (HUB4) - Description - Download
  • (Special agreement) LDC98T31 - 1996 CSR HUB4 Language Model - Description - Download

1997

1996

  • (Special agreement) LDC96L14 - CELEX2 - Description - Contact us for data access
  • (Non-member agreement) LDC96S60 - CALLFRIEND Vietnamese - Description - Download
  • (Special agreement) LDC96T10 - Message Understanding Conference (MUC) 6 Additional News Text - Description - Contact us for data access
  • (Special agreement) LDC96T11 - COMLEX Syntax Text Corpus Version 2.0 - Description - Contact us for data access

1995

1994

1993