Constellate

Table of Contents

What is Constellate?

Constellate is a browser-based tool for creating datasets from collections, such as JSTOR, and then teaches and facilitates text analysis on those datasets.

A number of collections can be analyzed (including your own content), with more being added in the future. See our tutorial on how to build a dataset from these sources in Constellate.

Datasets are analyzed using python code run in Jupyter notebooks. Tutorials and sample code (that you can modify) help you get up and running quickly (with four short tutorials to get you started with python if that is new to you).

Not only can you teach yourself text analysis using python, Constellate provides How-To Guides (many aimed at teachers), and encourages using their materials to teach this in your classes.

Some parts of the Constellate site are available to everyone. Subscribers of the tool get extra perks: users are able to build larger datasets (up to 50,000 items), use Constellate's Juptyer Lab, and take advantage of more computational power to run analyses.

How Do I Access Constellate?

See this access tutorial to log into Constellate using full University of Toronto institutional permissions.

What is Text Analysis?

Text Analysis, using a tool such as Constellate, allows researchers to quickly analyze a large number of documents (more than a human could read, and faster). Some questions that can be answered using text analysis would be:

  • What are these texts about?
  • How are these texts connected?
  • What emotions (or affects) are found within these texts?
  • What names are used in these texts?
  • Which of these texts are most similar?

These questions can be addressed using techniques such as finding word frequencies, performing topic modeling, or using sentiment analysis. Constellate tutorials can help you learn more about performing these tasks and answering these questions.

How Can I Receive Training?

Synchronous (Live) Training

Constellate periodically offers synchronous remote training. You can browse and register for Constellate classes from their website. 

Upcoming virtual training

If you are not yet ready to code, but want to learn about text analysis topics, we are excited to introduce you to our soon-to-be-released pre-code tutorials on text analysis. 

If you are feeling ready to start learning code in order to advance your text analysis research beyond the data visuals in the Constellate dataset builder:

  • Python Basics is a gentle introduction to Python which is a skill-set that will enable you to write your own text analysis code.  
  • Python Intermediate will expand your skills.

If you have an interest in Large Language Models (advanced artificial intelligence systems that use deep learning techniques to understand and generate human-like text based on vast amounts of training data):

Recordings

FAQs

How to upload your own texts to Constellate, such as PDFs and other text forms?

Constellate provides guidance on how to get your text files into a format that Constellate needs; however, it is just guidance, so an intermediate or higher level of python knowledge will be required to do this. 

This advice, however, only works for text files. If you have a bunch of PDFs, you will first need to OCR them (if not already done) and then convert them to text files. Tools, such as Adobe Acrobat Pro DC, will help you to batch OCR and export them to text.
•   Batch OCR
•   Convert PDF to text

How do I cite a dataset and tool?

Constellate has recently provided some advice on how to cite a dataset.

How do I analyze collocation of terms? Is there a tool for identifying whether a noun is in the subject or object position?

Response from Constellate: 

It seems like what you are interested in here is parts of speech tagging. A good place to start with this might be the spaCy library. The documentation is really strong and the software is very current with a new version being released recently.

Python’s nltk library also can do this type of work and it’s kind of the tried and true library for text analysis in Python. I’ve used it to identify noun phrases in text.

There are other libraries for POS tagging and the like in Python (textblob, pattern), too, so we would be interested in hearing about how any of these work out for you. 

Any of these libraries will requires the full text of the document to the linguistic analysis so you will want to limit your Constellate dataset using the “Full text only” filter on search page.

How do I remove certain sections, such as a Table of Contents, from the text before analysis?

Response from Constellate: 

This will be highly dependent on the content of each text. Some strategies that may work could be inferring what parts of the text classify as table of contents. With Python, this may look like writing a function that identifies a list of chapters followed by numbers with a newline at the end, all of which may exist at the beginning of some text.

You might also want to go about this by doing some pre-processing of your dataset -- so you throw out any article with a title "Front Matter", "Back Matter" or "Table of Contents".  The Constellate "Exploring Metadata and Pre-Processing" notebook does some of this.  I've also been personally exploring downloading the metadata CSV to my computer and doing some sorting, deletions, and clean-up in Excel.  I can then save it as CSV and upload that CSV back into the Constellate analytics lab into the data directory and when I work with my dataset in the Lab, I'll just be doing analysis on those citations that remain in the CSV, not the original full contents of my dataset.

Constellate covers some ways to do pre-processing in their July/August Intro to Text Analytics class.

Additional Resources

Alternative to Constellate

If you prefer to work with a point-and-click interface without coding, consider getting started with the Gale Digital Scholar Lab. Alternatively, if you enjoy working with Python but want to access ProQuest's collections, including but not limited to the New York Times and ProQuest Dissertations and Theses, see this guide on getting started with ProQuest TDM Studio.

Help

If you require help, either accessing Constellate or with its contents, please feel free to contact us at the Map & Data Library. You can either email mdl@library.utoronto.ca or reach out with the MDL contact form.

Constellate also offers office hours for Beta participants (including University of Toronto users). Here is the schedule of Constellate office hours.

Related Resources

There are a number of resources available at the University of Toronto for learning and working with tools for text analysis, as well as the broader field of text and data mining. If you would like to learn more, see this link.