Table of Contents
- What is Constellate?
- How Do I Access Constellate?
- What is Text Analysis?
- How Can I Receive Training?
- Additional Resources
What is Constellate?
Constellate is a browser-based tool for creating datasets from collections, such as JSTOR, and then teaches and facilitates text analysis on those datasets.
A number of collections can be analyzed (including your own content), with more being added in the future. See our tutorial on how to build a dataset from these sources in Constellate.
Datasets are analyzed using python code run in Jupyter notebooks. Tutorials and sample code (that you can modify) help you get up and running quickly (with four short tutorials to get you started with python if that is new to you).
Not only can you teach yourself text analysis using python, Constellate provides How-To Guides (many aimed at teachers), and encourages using their materials to teach this in your classes.
Some parts of the Constellate site are available to everyone. Subscribers of the tool get extra perks: users are able to build larger datasets (up to 50,000 items), use Constellate's Juptyer Lab, and take advantage of more computational power to run analyses.
How Do I Access Constellate?
See this access tutorial to log into Constellate using full University of Toronto institutional permissions.
What is Text Analysis?
Text Analysis, using a tool such as Constellate, allows researchers to quickly analyze a large number of documents (more than a human could read, and faster). Some questions that can be answered using text analysis would be:
- What are these texts about?
- How are these texts connected?
- What emotions (or affects) are found within these texts?
- What names are used in these texts?
- Which of these texts are most similar?
These questions can be addressed using techniques such as finding word frequencies, performing topic modeling, or using sentiment analysis. Constellate tutorials can help you learn more about performing these tasks and answering these questions.
How Can I Receive Training?
Synchronous (Live) Training
Constellate periodically offers synchronous remote training. You can browse and register for Constellate classes from their website. When new classes are announced, we will update this page with that information.
- Workshop: What Can You Do With Word Counts?
February 8, 12:00pm EST
If you have heard that text analysis involves counting words and are curious about what you can do with word counts, we invite you to come join this webinar where we will explore the basics of counting words and research techniques that leverage word counts. You'll discover the abundant research opportunities that unfold just with word counts. Register to attend in person or access the recording asynchronously.
- Workshop: How to Build a Good Dataset
February 15, 12:00pm EST
Constellate offers a most excellent dataset builder, but a good dataset for use in text analysis requires thought and consideration beyond tweaking filters in an application. In this webinar, we discuss what you should consider when building a dataset for text analysis. Register to attend in person or access the recording asynchronously.
- Constellate Class - Python Basics
Python is a commonly used, easy to learn computing language and the Constellate Python Basics class helps students, faculty and staff get started writing Python code. It will introduce you to Jupyter Notebooks and explore operators, expressions, data types, variables, functions, flow control, lists, and dictionaries. You do not need any previous coding experience, nor do you need any software installed on your computer.
This class is taught by Nathan Kelber and Zhuo Chen and runs on Monday, Wednesday and Friday the week of February 20 and the week of February 27. Each session runs at 10 am and 3:30 pm Eastern -- the two sessions on the same day are identical.
- Constellate: A New Platform for Text Analysis (Nov. 30, 2021): Recording - 55:30 & Setup Instructions (please note that Constellate login instructions have changed slightly since this video was recorded)
- Constellate provides a video introduction at the start of each of their Beginner-level tutorials, walking you through the content.
- In addition, you can access recordings of past training sessions.
- There is also a video on Exploring Text Analysis for Research
How to upload your own texts to Constellate, such as PDFs and other text forms?
Constellate provides guidance on how to get your text files into a format that Constellate needs; however, it is just guidance, so an intermediate or higher level of python knowledge will be required to do this.
This advice, however, only works for text files. If you have a bunch of PDFs, you will first need to OCR them (if not already done) and then convert them to text files. Tools, such as Adobe Acrobat Pro DC, will help you to batch OCR and export them to text.
• Batch OCR
• Convert PDF to text
How do I cite a dataset and tool?
Constellate has recently provided some advice on how to cite a dataset.
How do I analyze collocation of terms? Is there a tool for identifying whether a noun is in the subject or object position?
Response from Constellate:
It seems like what you are interested in here is parts of speech tagging. A good place to start with this might be the spaCy library. The documentation is really strong and the software is very current with a new version being released recently.
Python’s nltk library also can do this type of work and it’s kind of the tried and true library for text analysis in Python. I’ve used it to identify noun phrases in text.
There are other libraries for POS tagging and the like in Python (textblob, pattern), too, so we would be interested in hearing about how any of these work out for you.
Any of these libraries will requires the full text of the document to the linguistic analysis so you will want to limit your Constellate dataset using the “Full text only” filter on search page.
How do I remove certain sections, such as a Table of Contents, from the text before analysis?
Response from Constellate:
This will be highly dependent on the content of each text. Some strategies that may work could be inferring what parts of the text classify as table of contents. With Python, this may look like writing a function that identifies a list of chapters followed by numbers with a newline at the end, all of which may exist at the beginning of some text.
You might also want to go about this by doing some pre-processing of your dataset -- so you throw out any article with a title "Front Matter", "Back Matter" or "Table of Contents". The Constellate "Exploring Metadata and Pre-Processing" notebook does some of this. I've also been personally exploring downloading the metadata CSV to my computer and doing some sorting, deletions, and clean-up in Excel. I can then save it as CSV and upload that CSV back into the Constellate analytics lab into the data directory and when I work with my dataset in the Lab, I'll just be doing analysis on those citations that remain in the CSV, not the original full contents of my dataset.
Constellate covers some ways to do pre-processing in their July/August Intro to Text Analytics class.
Alternative to Constellate
If you prefer to work with a point-and-click interface without coding, consider getting started with the Gale Digital Scholar Lab. Alternatively, if you enjoy working with Python but want to access ProQuest's collections, including but not limited to the New York Times and ProQuest Dissertations and Theses, see this guide on getting started with ProQuest TDM Studio.
If you require help, either accessing Constellate or with its contents, please feel free to contact us at the Map & Data Library. You can either email firstname.lastname@example.org or reach out with the MDL contact form.
Constellate also offers office hours for Beta participants (including University of Toronto users). Here is the schedule of Constellate office hours.
There are a number of resources available at the University of Toronto for learning and working with tools for text analysis, as well as the broader field of text and data mining. If you would like to learn more, see this link.