Linguistic Data Consortium

The University of Toronto is a subscriber to the Linguistic Data Consortium which licenses language corpora and other language resources. For more information about the LDC, please visit their website

The following is a list of corpora that U of T has licensed from the LDC over the years. These may be downloaded by U of T students staff and faculty. After clicking one of the links you must review the terms of use before accessing the data. A few corpora are too large for download; please contact us to access these datasets.

This list does not include all corpora available from LDC, so we encourage you to also browse the full list of corpora on the LDC website. If LDC offers a corpus you need but which is not listed on this page, please get in touch with us, as we may be able to obtain it on your behalf.

2021

2020

2019

2018

2017

2015

  • (Non-member agreement) LDC2015E21 - CoNLL-2015 Shared Task on Shallow Discourse Parsing - Training and Development Data - Description - Download
  • (Non-member agreement) LDC2015T08 - Coordination Annotation for the Penn Treebank - Description - Downloadnote: this is the revised data for LDC99T42
  • (Non-member agreement) LDC2015T13 - English News Text Treebank: Penn Treebank Revised - Description - Download

2014

2013

2012

2011

2009

  • (Special agreement) LDC2009T26 - NXT Switchboard Annotations - Description - Contact us for data access

2008

2007

  • (Non-member agreement) LDC2007S10 - 2003 NIST Rich Transcription Evaluation Data - Description - Download
  • (Non-member agreement) LDC2007T36 - Chinese Treebank 6.0 - Description - Download

2006

2005

2004

2003

2002

2001

2000

1999

  • (Non-member agreement) LDC99S78 - SUSAS - Description - Download
  • (Non-member agreement) LDC99T42 - Treebank-3 - Description - Download - note: please see LDC2015T08 above for revised data

1998

  • (Special agreement) LDC98L21 - COMLEX English Syntax Lexicon - Description - Download
  • (Non-member agreement) LDC98S71 - 1997 English Broadcast News Speech (HUB4) - Description - Contact us for data access
  • (Non-member agreement) LDC98T28 - 1997 English Broadcast News Transcripts (HUB4) - Description - Download
  • (Special agreement) LDC98T31 - 1996 CSR HUB4 Language Model - Description - Download

1997

1996

  • (Special agreement) LDC96L14 - CELEX2 - Description - Contact us for data access
  • (Non-member agreement) LDC96S60 - CALLFRIEND Vietnamese - Description - Download
  • (Special agreement) LDC96T10 - Message Understanding Conference (MUC) 6 Additional News Text - Description - Contact us for data access
  • (Special agreement) LDC96T11 - COMLEX Syntax Text Corpus Version 2.0 - Description - Contact us for data access

1995

1994

1993