Table of Contents
- What is Constellate?
- How Do I Access Constellate?
- What is Text Analysis?
- How Can I Receive Training?
- Additional Resources
Constellate is a browser-based tool for creating datasets from collections, such as JSTOR, and then teaches and facilitates text analysis on those datasets.
Datasets are analyzed using python code run in Jupyter notebooks. Tutorials and sample code (that you can modify) help you get up and running quickly (with four short tutorials to get you started with python if that is new to you).
Not only can you teach yourself text analysis using python, Constellate provides How-To Guides (many aimed at teachers), and encourages using their materials to teach this in your classes.
Some parts of the Constellate site are available to everyone. Subscribers of the tool get extra perks: users are able to build larger datasets (up to 50,000 items), use Constellate's Juptyer Lab, and take advantage of more computational power to run analyses.
See this access tutorial to log into Constellate using full University of Toronto institutional permissions.
Text Analysis, using a tool such as Constellate, allows researchers to quickly analyze a large number of documents (more than a human could read, and faster). Some questions that can be answered using text analysis would be:
- What are these texts about?
- How are these texts connected?
- What emotions (or affects) are found within these texts?
- What names are used in these texts?
- Which of these texts are most similar?
These questions can be addressed using techniques such as finding word frequencies, performing topic modeling, or using sentiment analysis. Constellate tutorials can help you learn more about performing these tasks and answering these questions.
Synchronous (Live) Training
Constellate periodically offers synchronous remote training. You can browse and register for Constellate classes from their website.
Upcoming virtual training
If you are not yet ready to code, but want to learn about text analysis topics, we are excited to introduce you to our soon-to-be-released pre-code tutorials on text analysis.
- Pre-code tutorials on text analysis webinar
- March 18 at 10 am and 3:30 pm EST.
- Register for Pre-code tutorials on text analysis webinar
If you are feeling ready to start learning code in order to advance your text analysis research beyond the data visuals in the Constellate dataset builder:
- Python Basics is a gentle introduction to Python which is a skill-set that will enable you to write your own text analysis code.
- February 12-16 at 10 am and 3:30 pm EST
- Register for Python Basics
- Python Intermediate will expand your skills.
- February 19-23 at 10 am and 3:30 pm EST
- Register for Python Intermediate
If you have an interest in Large Language Models (advanced artificial intelligence systems that use deep learning techniques to understand and generate human-like text based on vast amounts of training data):
- Gender bias and stereotypes in Large Language Models is a webinar on research that shows how four recently published LLMs express biased assumptions about men and women’s occupations.
- March 15, noon-1pm EST
- Register for the Gender bias and stereotypes in Large Language Models webinar
- Introduction to Language Models - also-known-as, how does ChatGPT work? Is a webinar series providing a detailed explanation of the neural networks of large language models.
- March 20-March 29 at 10 am and 3:30 pm EST
- Register for Introduction to Language Models -- also-known-as, how does ChatGPT work? series
- Constellate: A New Platform for Text Analysis (Nov. 30, 2021): Recording - 55:30 & Setup Instructions (please note that Constellate login instructions have changed slightly since this video was recorded)
- Constellate provides a video introduction at the start of each of their Beginner-level tutorials, walking you through the content.
- In addition, you can access recordings of past training sessions.
- There is also a video on Exploring Text Analysis for Research
How to upload your own texts to Constellate, such as PDFs and other text forms?
Constellate provides guidance on how to get your text files into a format that Constellate needs; however, it is just guidance, so an intermediate or higher level of python knowledge will be required to do this.
This advice, however, only works for text files. If you have a bunch of PDFs, you will first need to OCR them (if not already done) and then convert them to text files. Tools, such as Adobe Acrobat Pro DC, will help you to batch OCR and export them to text.
• Batch OCR
• Convert PDF to text
How do I cite a dataset and tool?
Constellate has recently provided some advice on how to cite a dataset.
How do I analyze collocation of terms? Is there a tool for identifying whether a noun is in the subject or object position?
Response from Constellate:
It seems like what you are interested in here is parts of speech tagging. A good place to start with this might be the spaCy library. The documentation is really strong and the software is very current with a new version being released recently.
There are other libraries for POS tagging and the like in Python (textblob, pattern), too, so we would be interested in hearing about how any of these work out for you.
Any of these libraries will requires the full text of the document to the linguistic analysis so you will want to limit your Constellate dataset using the “Full text only” filter on search page.
How do I remove certain sections, such as a Table of Contents, from the text before analysis?
Response from Constellate:
This will be highly dependent on the content of each text. Some strategies that may work could be inferring what parts of the text classify as table of contents. With Python, this may look like writing a function that identifies a list of chapters followed by numbers with a newline at the end, all of which may exist at the beginning of some text.
You might also want to go about this by doing some pre-processing of your dataset -- so you throw out any article with a title "Front Matter", "Back Matter" or "Table of Contents". The Constellate "Exploring Metadata and Pre-Processing" notebook does some of this. I've also been personally exploring downloading the metadata CSV to my computer and doing some sorting, deletions, and clean-up in Excel. I can then save it as CSV and upload that CSV back into the Constellate analytics lab into the data directory and when I work with my dataset in the Lab, I'll just be doing analysis on those citations that remain in the CSV, not the original full contents of my dataset.
Constellate covers some ways to do pre-processing in their July/August Intro to Text Analytics class.
Alternative to Constellate
If you prefer to work with a point-and-click interface without coding, consider getting started with the Gale Digital Scholar Lab. Alternatively, if you enjoy working with Python but want to access ProQuest's collections, including but not limited to the New York Times and ProQuest Dissertations and Theses, see this guide on getting started with ProQuest TDM Studio.
If you require help, either accessing Constellate or with its contents, please feel free to contact us at the Map & Data Library. You can either email firstname.lastname@example.org or reach out with the MDL contact form.
Constellate also offers office hours for Beta participants (including University of Toronto users). Here is the schedule of Constellate office hours.
There are a number of resources available at the University of Toronto for learning and working with tools for text analysis, as well as the broader field of text and data mining. If you would like to learn more, see this link.