Linguistic Data Consortium

The University of Toronto is a subscriber to the Linguistic Data Consortium which licenses language corpora and other language resources. For more information about the LDC, please visit their website

The following is a list of corpora that U of T has licensed from the LDC over the years. These may be downloaded by U of T students, staff, and faculty. After clicking one of the links you must review the terms of use before accessing the data. A few corpora are too large for download; please contact us to access these datasets.

This list does not include all corpora available from LDC, so we encourage you to also browse the full list of corpora on the LDC website. If LDC offers a corpus you need but which is not listed on this page, please get in touch with us, as we may be able to obtain it on your behalf.

2024

2023

2022

2021

2020

2019

2018

2017

  • (Special agreement) LDC2017S01 - IARPA Babel Vietnamese Language Pack IARPA-babel107b-v.0.7 - Description - Contact us for data access
  • (Special agreement) LDC2017S03 - IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b - DescriptionContact us for data access
  • (Non-member agreement) LDC2017S24 - CHiME3 - Description - Contact us for data access
  • (Non-member agreement) LDC2017T14 - Ancient Chinese Corpus - Description - Download

2016

2015

  • (Non-member agreement) LDC2015E21 - CoNLL-2015 Shared Task on Shallow Discourse Parsing - Training and Development Data - Description - Download
  • (Non-member agreement) LDC2015T08 - Coordination Annotation for the Penn Treebank - Description - Downloadnote: this is the revised data for LDC99T42
  • (Non-member agreement) LDC2015T13 - English News Text Treebank: Penn Treebank Revised - Description - Download

2014

2013

2012

2011

2010

2009

2008

  • (Special agreement) LDC2008S01 - CSLU: Portland Cellular Telephone Speech Version 1.3 - DescriptionDownload
  • (Special agreement) LDC2008S02 - CSLU: National Cellular Telephone Speech Release 2.3 - DescriptionDownload
  • (Non-member agreement) LDC2008S04 - West Point Brazilian Portuguese Speech - Description - Download
  • (Special agreement) LDC2008S06 - CSLU: Alphadigit Version 1.3 - DescriptionDownload
  • (Special agreement) LDC2008S07 - CSLU: ISOLET Spoken Letter Database Version 1.3 - DescriptionDownload
  • (Non-member agreement) LDC2008T05 - Penn Discourse Treebank Version 2.0 - Description - Download
  • (Special agreement) LDC2008T19 - The New York Times Annotated Corpus - Description - Contact us for data access
  • (Non-member agreement) LDC2008T23 - NomBank v 1.0 - Description - Download
  • (Non-member agreement) LDC2008T24 - COMNOM v 1.0 - Description - Download

2007

  • (Special agreement) LDC2007S05 - CSLU: Yes/No Version 1.2 - Description - Download
  • (Special agreement) LDC2007S08 - CSLU: Foreign Accented English Release 1.2 - DescriptionDownload
  • (Non-member agreement) LDC2007S10 - 2003 NIST Rich Transcription Evaluation Data - Description - Download
  • (Special agreement) LDC2007S13 - CSLU: Apple Words and Phrases - DescriptionDownload
  • (Special agreement) LDC2007S18 - CSLU: Kids` Speech Version 1.1 - DescriptionContact us for data access
  • (Non-member agreement) LDC2007T36 - Chinese Treebank 6.0 - Description - Download

2006

2005

2004

2003

2002

2001

2000

1999

  • (Non-member agreement) LDC99S78 - SUSAS - Description - Download
  • (Non-member agreement) LDC99T42 - Treebank-3 - Description - Download - note: please see LDC2015T08 above for revised data

1998

  • (Special agreement) LDC98L21 - COMLEX English Syntax Lexicon - Description - Download
  • (Non-member agreement) LDC98S71 - 1997 English Broadcast News Speech (HUB4) - Description - Contact us for data access
  • (Non-member agreement) LDC98T28 - 1997 English Broadcast News Transcripts (HUB4) - Description - Download
  • (Special agreement) LDC98T31 - 1996 CSR HUB4 Language Model - Description - Download

1997

1996

  • (Special agreement) LDC96L14 - CELEX2 - Description - Contact us for data access
  • (Non-member agreement) LDC96S60 - CALLFRIEND Vietnamese - Description - Download
  • (Special agreement) LDC96T10 - Message Understanding Conference (MUC) 6 Additional News Text - Description - Contact us for data access
  • (Special agreement) LDC96T11 - COMLEX Syntax Text Corpus Version 2.0 - Description - Contact us for data access

1995

1994

1993