Table of Contents
- What is Constellate?
- How Do I Access Constellate?
- What is Text Analysis?
- How Can I Receive Training?
- Additional Resources
Constellate is a browser-based tool for creating datasets from collections, such as JSTOR, and then teaches and facilitates text analysis on those datasets.
Datasets are analyzed using python code run in Jupyter notebooks. Tutorials and sample code (that you can modify) help you get up and running quickly (with four short tutorials to get you started with python if that is new to you).
Not only can you teach yourself text analysis using python, Constellate provides How-To Guides (many aimed at teachers), and encourages using their materials to teach this in your classes.
While some parts of the Constellate site are available to everyone, the University of Toronto is participating in a special Beta Evaluation Period, meaning UofT users can take advantage of additional perks, such as being able to build larger datasets (up to 50,000 items) and take advantage of more computational power to run analyses. During this time, Constellate is working to improve their offerings, and so they are soliciting feedback. If you use Constellate in a teaching or research setting, please contact us with feedback, which we can anonymously pass on to them.
See this access tutorial to log into Constellate using full University of Toronto institutional permissions.
Text Analysis, using a tool such as Constellate, allows researchers to quickly analyze a large number of documents (more than a human could read, and faster). Some questions that can be answered using text analysis would be:
- What are these texts about?
- How are these texts connected?
- What emotions (or affects) are found within these texts?
- What names are used in these texts?
- Which of these texts are most similar?
These questions can be addressed using techniques such as finding word frequencies, performing topic modeling, or using sentiment analysis. Constellate tutorials can help you learn more about performing these tasks and answering these questions.
Synchronous (Live) Training
Constellate periodically offers synchronous remote training. When new classes are announced, we will update this page with that information.
- New upcoming class for May 2022: Working with Strings and Regular Expressions (Registration open now)
Constellate will be teaching two classes in April.
- Python Basics is a four day, one week class running the week of April 4 to help you get started writing Python code (this is quite similar to Python Basics we have taught in the past, though we are splitting it out over 4 days.)
- Tokenize your own Texts is a three day, one week class running the week of April 11, 2022 to introduce you to processes and methods for tokenizing your texts and creating a dataset that is compatible with existing Constellate Notebooks.
Constellate provides a video introduction at the start of each of their Beginner-level tutorials, walking you through the content.
In addition, you can access recordings of past training sessions.
How to upload your own texts to Constellate, such as PDFs and other text forms?
Constellate provides guidance on how to get your text files into a format that Constellate needs; however, it is just guidance, so an intermediate or higher level of python knowledge will be required to do this.
This advice, however, only works for text files. If you have a bunch of PDFs, you will first need to OCR them (if not already done) and then convert them to text files. Tools, such as Adobe Acrobat Pro DC, will help you to batch OCR and export them to text.
• Batch OCR
• Convert PDF to text
How do I cite a dataset and tool?
Constellate has recently provided some advice on how to cite a dataset.
How do I analyze collocation of terms? Is there a tool for identifying whether a noun is in the subject or object position?
Response from Constellate:
It seems like what you are interested in here is parts of speech tagging. A good place to start with this might be the spaCy library. The documentation is really strong and the software is very current with a new version being released recently.
There are other libraries for POS tagging and the like in Python (textblob, pattern), too, so we would be interested in hearing about how any of these work out for you.
Any of these libraries will requires the full text of the document to the linguistic analysis so you will want to limit your Constellate dataset using the “Full text only” filter on search page.
How do I remove certain sections, such as a Table of Contents, from the text before analysis?
Response from Constellate:
This will be highly dependent on the content of each text. Some strategies that may work could be inferring what parts of the text classify as table of contents. With Python, this may look like writing a function that identifies a list of chapters followed by numbers with a newline at the end, all of which may exist at the beginning of some text.
You might also want to go about this by doing some pre-processing of your dataset -- so you throw out any article with a title "Front Matter", "Back Matter" or "Table of Contents". The Constellate "Exploring Metadata and Pre-Processing" notebook does some of this. I've also been personally exploring downloading the metadata CSV to my computer and doing some sorting, deletions, and clean-up in Excel. I can then save it as CSV and upload that CSV back into the Constellate analytics lab into the data directory and when I work with my dataset in the Lab, I'll just be doing analysis on those citations that remain in the CSV, not the original full contents of my dataset.
Constellate covers some ways to do pre-processing in their July/August Intro to Text Analytics class.
Alternative to Constellate
If you prefer to work with a point-and-click interface without coding, please consider Gale’s Digital Scholar Lab. Like Constellate, it provides a number of sample collections, cleaning options, and text analysis tools. You can get started with our brief guide on accessing the Digital Scholar Lab, then either use the videos in the platform or follow our Digital Scholar Lab tutorial.
If you require help, either accessing Constellate or with its contents, please feel free to contact us at the Map & Data Library. You can either email firstname.lastname@example.org or reach out with the MDL contact form.
Constellate also offers office hours for Beta participants (including University of Toronto users). Here is the schedule of Constellate office hours.
There are a number of resources available at the University of Toronto for learning and working with tools for text analysis, as well as the broader field of text and data mining. If you would like to learn more, see this link.