Although the Map & Data Library is physically closed, we are still available remotely and happy to help. We can conduct consultations using online teleconferencing software. Please feel free to contact us at mdl@library.utoronto.ca or use our help form. We have a number of tutorials available, are still supplying software licenses, and have compiled a list of resources for working with COVID-19 data.

Please note that our computer lab is also accessible for use through remote access. See this link for more information.

COVID-19: Updates on library services and operations

Digital Humanities Tools: Digital Scholar Lab

This tutorial introduces Gale's Digital Scholar Lab (DSL), a digital humanities tool. In this tutorial, you will learn how to:

  • Build a collection of texts
  • Clean texts 
  • Run analytical tools on texts and visualize the results
  • Download the data, graphs, and other visualizations produced through this tool 
  • Download the scanned texts in your collection, so that you can use them in other programs

Note: Gale consistently updates the Digital Scholar Lab, so some features of this tutorial might not always match the latest interface. This tutorial was last updated in August 2020.

Table of Contents

  1. Overview
  2. Access
  3. Collection
  4. Cleaning
  5. Tools
    1. Document Clustering
    2. Named Entity Recognition
    3. Ngram
    4. Parts of Speech Tagger
    5. Sentiment Analysis
    6. Topic Modeling
  6. Export
  7. Additional Training
    1. See Which Texts are Available
    2. Learning Center
    3. Sample Projects
    4. Upload Texts
  8. Contact

Overview

(back to Table of Contents)

What is Gale Digital Scholar Lab?

The Digital Scholar Lab (DSL) is an online tool for analyzing texts, visualizing the results, and exporting data, graphs, and texts from the platform.  It runs in your Internet browser and does not need any additional software.

The DSL has six analysis tools: (1) Document Clustering, (2) Named Entity Recognition, (3) Ngram, (4) Parts of Speech Tagger, (5) Sentiment Analysis, and (6) Topic Modeling. The DSL makes it easier to learn and understand how these tools work by providing user-friendly graphical user interfaces, documentation, and demonstration videos. External links to the open source code for each tool are also made available should you wish to run the tool on your own computer and use its more advanced features.

What collections does it have?

When you use the DSL through your University of Toronto connection, you can use any of the Gale primary source collections that the University has licensed, including hundreds of thousands of documents in multiple languages with broad historical and geographical coverage. (Once you are logged in, see these instructions to view all accessible collections.) Extensive coverage, however, should not be confused with universal coverage; many perspectives are not represented in these text collections. For example, most of the colonial-era documents included in these collections were produced and collected by colonizing people, organizations, or institutions, rather than by colonized peoples. It is up to you as a critical scholar to decide on which questions can and cannot be answered by these collections.

Digitization

The texts available in the DSL have gone through several steps: (1) various institutions like libraries and archives collected the texts; (2) Gale scanned the text; (3) through a process called Optical Character Recognition (OCR) these scans—which are essentially photographs of texts—are converted into readable, searchable text.

OCR uses image-recognition algorithms to identify characters and create a text file based on the image. OCR is powerful, but it is also prone to errors such as misidentifying characters (e.g. reading a zero as the letter 'O') or adding or removing spaces. There are additional challenges for scanning older English texts, such as those that use the long 's' ('ſ'), which resembles a lowercase 'f'. We will cover this more below in the section on Cleaning, but for now it is sufficient to know that this process can often leave errors in the text files produced through OCR.

Let's get started!

Access

(back to Table of Contents)

  1. To access the DSL through our U of T institutional connection, go to https://uoft.me/gale.
    (You can also access the DSL through the library catalog.)  

  2. The website will first prompt you for your University of Toronto login.
    my.access page
    Enter your UTORid.
    UTORid login page
    Depending on your settings and location, you may also have to log in with your University of Toronto Library barcode (located on your T-Card) and PIN, as you would to use other library services (e.g. renewing your loans).

  3. You will arrive at the DSL homepage, but you will need to log in before you can do anything. Currently, this service can only authenticate Google accounts. DSL doesn't use your Google account for authentication, but solely to track session information, so If you have any reservations about using your personal Gmail account, feel free to make a new account just for DSL. Gale’s privacy policy states:

    When logging in to the Gale Digital Scholar Lab App using either Google or Microsoft, the App accesses the user’s Google Drive or Microsoft OneDrive through an anonymous access token that is created when users first log in. This anonymous token is generated in order to connect users to the content and analysis they create in the Digital Scholar Lab. The App does not collect, read, access, or store any of the data from a user’s Google Drive or Microsoft OneDrive account(s), nor does it access any open documents. In addition, the App does not access or share personal information as part of this process.

    You can also email privacy@cengage.com for more details.

    If this is acceptable to you, click on Log In / Create Account,
    Digital Scholar Lab login page

    choose Google, and then log in with a Google account.
    Log in popup with Sign in with Google highlighted

  4. When you return to the Digital Scholar Lab homepage, you should notice your name in the top-right corner, under Signed in. The central bar has changed from a login prompt to a search bar. You're in!
    Homepage after being logged in

Collection

(back to Table of Contents)

The DSL has access to Gale's extensive digital collections of newspapers, books, and other archival material. In this tutorial, we will focus on texts by and about an early twentieth-century scholar named Sir Aurel Stein, who travelled throughout South and Central Asia and China as an agent of the British Empire. There are many materials related to him - works by him, newspaper articles about his expeditions, etc. - available in the DSL, mostly ranging from the 1920s to the 1940 from a variety of newspapers (The London Times, Illustrated London News, The Daily Mail, The New York Times) and other digitised holdings from libraries and archives (the British Library, the Smithsonian, the American Antiquarian Society, the National Library of China).

  1. Now that you’re logged in, you will need to create your corpus (the collection of texts on which you will run statistics on and create visualizations). Although you can use the toolbar in the centre of the homepage to get started with a basic search, let’s use the advanced search for a little more precision. Click on Advanced search under the search toolbar.
    Digital Scholar Lab homepage with advanced search highlighted

  2. You will arrive at the Advanced Search page.
    Advanced search screen
    It is structured like a library catalogue search page. You can use Boolean operators like AND, OR, and NOT to create complex search queries. Gale also automatically suggests metadata (e.g. authorship, publication date, language) from their collection as you type.

  3. Note: this is the first screen where you can view DSL's learning videos. Feel free to watch these as you complete these steps to get an introduction to the features available on that page.

  4. Let’s search for Aurel Stein as an Author. Begin by clicking on the dropdown menu on the right of the first field.
    location of the dropdown menu

  5. Change it from Keyword to Author.
    Dropping down to select author

  6. Then, click in the search bar and type Aurel Stein.
    Advanced search with search bar selected

  7. Once you start typing his name, if you have “Author” selected as the search field, you will see at least four different variations of his name appear.

    Note: DSL automatically suggests metadata like author names from its collections, but only if the correct field (e.g. author, keyword) is selected first. Because different databases have input Stein's names differently, there are four variations of his name (and thus four different "authors" when searching). Although this is not ideal, it is very common when working with data aggregated from many places. Let’s say that we want everything written by Stein, regardless of how his name appears in the various databases.

  8. Select the first option, “Aurel Stein K. C. I. E.”. 
    Typing "aurel stein" results in several automatic recommendations

  9. Because are four variations of his name and we want texts that are attributed to any of them, we will use an OR operator. Click on the left dropdown menu and select OR.
    Changing AND to OR
    Then, repeat the above steps, first changing Keyword to Author, then typing Aurel Stein, and selecting the next variation on his name.

  10. Since there are four variations of his name, we need more than three search terms. Click on Add a Row at the bottom to add a fourth row.
    Add a row button

  11. Once you are done, there should be four lines joined with OR. Click Search.
    Final result, with four lines of authors

  12. You now see the search results page.
    Search results page

  13. For each result, you will see its title, its author, its OCR Confidence Percentage (see the note below), and a preview of the text.
    basic information for the first result highlighted

  14. Additionally, there is some metadata to the right, including the year of publication and the archive, source, and type of the text.
    Metadata highlighted

    Note: OCR refers to Optical Character Recognition. It is a process whereby a program attempts to produce machine-readable text from a scanned document. The OCR Confidence percentage is an overall score that represents Gale’s confidence in the OCR quality of a specific text.

    One factor in their confidence level is the specific OCR algorithm used, since newer OCR algorithms typically perform more accurately than older ones. According to their documentation, over the nearly 20 years Gale has been collecting and scanning documents they have used a variety of OCR algorithms. Gale DSL currently uses Adobe Acrobat with ABBYY5 to OCR scanned documents.

    The OCR Confidence percentage additionally relies on other factors, such as the condition of the original document, the quality of the scan, what kind of text is featured in the document, and whether or not there are images in a document.

    A caution: the confidence level is useful but not perfect, as some documents can have a lower confidence than they deserve (if they feature lots of images), and some documents with high confidence can still include OCR mistakes. In other words, the confidence percentage is not a replacement for human eyes. Some of the oldest scans won’t have a confidence percentage at all, since they predate that system.

  15. Let’s use all of these texts. Click Select All. 
    Search results with Select All highlighted

  16. Then click Add to Content Set on the top right corner.
    All results selected, Add to Content Set button is highlighted

  17. Select New Content Set.
    New Content Set button

  18. Name it Stein and click Create. Close the notification that follows.
    Create Content Set popup

  19. Now, let's add some additional texts to this corpus. On the main toolbar, click on Build.
    Build button highlighted on search results page

  20. Then, click on Advanced Search.
    Build Page with Advanced Search highlighted

  21. The first time you used Advanced Search, you found texts by Aurel Stein. Now, let's look for texts about him, by searching for his name in the Keyword. Just like with steps 3-7 above, type Aurel Stein into the search bar with Keyword selected this time, select one of the variations that occurs, and repeat with the next line.
    Selecting "aurel stein" as keyword in advanced search

  22. Like before, add a fourth line to accommodate all four variations of his name.
    Advanced search, selecting OR and adding a fourth row

  23. Be sure to use OR to separate all rows.
    All four lines are added, connected by OR

  24. Finally, click Search.
    All four lines are added, now Search is highlighted

  25. Now there are many more texts. The left sidebar menu offers ways of filtering this dataset.
    Search results, with filters on the left

  26. Just to keep our English-focused analysis consistent, scroll down, and under Publication Languages, click on English.
    Sidebar menu with English highlighted

  27. Let's add the remaining documents to our collection. Scroll to the top of the page, click Select All. 
    Results with Select All highlighted

  28. Then, click Select All (295) Results.
    Popup with Select All (295) results highlighted
    Note: Gale constantly adds new items to its collections, so your number of results may differ from those shown here.

  29. Add this to the content set you just created.
    Add to Content Set button highlighted

  30. Note that the DSL automatically avoids adding duplicates, notifying you that 282 (of 295) documents were added. Again, your number may be different. Click Close.
    Notification that 282 documents were added to the content set

  31. To view more information on a specific item, click it, and you will be taken to the Doc Explorer view. Click the first item, "Sir Aurel Stein and Central Asia."
    Clicking on the first item in the list

  32. You’ll see the document in its original context on the left, with the text highlighted and the keywords selected. You’ll also get the complete text file on the right.
    Doc Explorer view

  33. By scrolling down, you can see the scanned text on the left, with its text highlighted in light blue, and the specific keywords ("Aurel Stein") highlighted in green.
    Scanned newspaper, with the selected text highlighted

  34. Above, you have options to cite, download, email, and print the text.
    Toolbar with options to remove, cite, export, download, email, and print the text
    If you click on “Learn how this text was created,” you’ll receive basic information on Gale’s OCR process, and a link to further documentation.
    Highlighted link: "Learn how this text was created"

  35. Let’s take a look at our collection. Click on Manage, located at the top right of the screen. You’ll get an overview of where the texts in this corpus come from.
    Highlighted "Manage" button under Active Content Set

  36. To return to the list of texts, click on the Documents button.
    Highlighted document button on the collection overview
    By clicking on Documents, you can view and manage the collection, removing texts if you wish. Having a collection means that you don’t need to rerun searches every time you log into the DSL; you can edit and refine your collections and rerun past searches.

You have now run two advanced searches, built a corpus, and learned how to examine individual documents as well as the collection overview. Now let's move on to cleaning, where you prepare your texts for analysis.

Note: as of June 2020, DSL has a beta feature allowing you to upload your own texts. While not necessary to complete this tutorial, you might wish to learn how to do this for your own research. Link to guide on uploading texts to DSL

Cleaning

(back to Table of Contents)

  1. Click on the Clean tab in the toolbar.
    Highlighting the Clean button on the main toolbar

  2. This is the Cleaning Configuration page, specifically the default configuration.
    Cleaning configuration page

  3. Cleaning configurations produce higher quality analysis and visualizations by removing errors and extraneous data. The default cleaning configuration is a good start for most projects, but based on previous testing, it leaves in a lot of junk data with our current collection. For better results, let's make our own cleaning configuration. Under Clean, click on "+ New Configuration."
    New Configuration button highlighted

  4. Name it Stein and click Submit.
    Create Configuration popup, with Stein input as name and Submit button highlighted

  5. To make our corpus a little more useful, under Cleaning Configuration, check “Remove all extended ASCII characters”, “Remove all number characters”, “Remove all special characters”, and “Remove all punctuation”. Leave the other settings at their defaults.
    Cleaning configuration with boxes for removing ASCII characters, numbers, special characters and punctuation highlighted
    Note: Extended ASCII characters are characters used in languages other than English (e.g. accented letters like é), as well as for some typesetting and mathematical uses. Since we’re exclusively working with modern English sources in this collection, extended ASCII characters will only appear as errors of the OCR process, and therefore excluding them will give us more meaningful results. Similarly, by excluding punctuation, numbers, and special characters, we will prevent the DSL from treating individual numbers or punctuation as words.

    If we leave punctuation in, the Ngram tool reveals that the most popular "word" is the period / fullstop, which is not very useful:

    Terrible word cloud dominated by numbers and punctuation marks
    Don't be like this; be sure to exclude punctuation, numbers, and special characters!

  6. Also, when you create a new configuration, you need to set a list of stop words. Stop words are common words, like “a” and “you,” that we filter out before running analyses on our corpus. If we don’t exclude them, then it turns out that the most common word in almost every English corpus is “the.” Under Stop Words, click Choose a Starter List.
    Choose a Starter List link highlighted

  7. Select English, then click Select starter lists.
    Choose a Starter List popup, English checked, Select starter lists highlighted
    Note: you can select multiple languages, if you are working with a corpus that includes texts in multiple languages. We'll stick with just English for this collection.

  8. Your Stop Words list is now populated with the most common English words.
    Stop Words list now contains a list beginning with a, about, above

  9. You can add words to your stop word list by typing them in. You must separate each word with paragraph returns (i.e. by hitting the Enter key). Add the following words to the stop word list by copying and pasting them into the list above the first word in the English stop word list ("a"):
    b
    c
    d
    e
    f
    g
    h
    j
    k
    l
    m
    n
    o

    p
    q
    r
    s
    t
    u
    v
    w
    x
    y
    z
    th
    pp
    Sec
    ii
    iii
    iv
    vi
    vii
    vii
    ix
    xi
    xii
    xiii
    xiv
    xv
    xvi
    xvii
    xviii
    xix
    xx
    xxi
    xxii
    xxiii
    xxiv
    xxv
    xxvi
    xxvii
    xxviii
    xxix
    xxx

    No.
    Sec.

    Stop Word list with both the new list and the preexisting English list
    Be sure to avoid erasing the English stop word list.

    Note: by adding these "words" to the list, which are a mix of abbreviations (like "Sec." for "section"), Roman numerals, and individual letters, we do two things: (1) we remove commonly used words that are related to the structure rather than the contents of the text, like "Sec." or the Roman numerals; and (2) we account for some OCR errors. Since the OCR process sometimes misreads a word, inserting a space where there should be none (e.g. misreading "apple" as "a pple"), single or paired letters appear as common "words" in some text collections. This isn't true for every collection, but it is true for this particular collection.

  10. Cleaning configurations can also replace words. One use is to treat variations of a key word or phrase as a single word. To prevent the various tools from treating "Stein," "Aurel Stein," "Sir Aurel Stein," etc. as different words, scroll down to the Replacements section.
    Replacements section

  11. In the first row, under "Replace this...", type Sir Aurel Stein. On the same row, under "With this...", type Aurel Stein.
    Replacing Sir Aurel Stein with Aurel Stein

  12. On the next row, replace Aurel with Aurel Stein. Repeat this process on the next row, replacing Sir Stein with Aurel Stein.
    Replacing both Aurel and Sir Stein with Aurel Stein
    Now all of these variations on his name will show up as the same version, "Aurel Stein," in our future analysis and visualizations.

  13. Finally, click Save.
    Save button highlighted
    Note: when you work with your own projects, you might need to adjust your cleaning configuration several times in order to remove errors specific to your texts.

Tools

(back to Table of Contents)

  1. Now that we have our text collection and our cleaning configuration, let’s try some tools. To get started, click on the Analyze button in the toolbar.
    Toolbar with Analyze highlighted

  2. Under the Content Set dropdown menu, click on Select Content Set.
    Analyze page with the Select Content Set dropdown menu highlighted
    Select Stein.
    Selecting Stein from the dropdown menu
    Now, click Add Tool.
    Add Tool button highlighted

  3. Here you can pick from Gale’s six tools. For this particular project, add all six tools: Document Clustering, Named Entity Recognition, Ngram, Parts of Speech Tagger, Sentiment Analysis, and Topic Modeling. This selection includes a mix of qualitative and quantitative tools.
    Adding Document Clustering and Named Entity Recognition
    Adding Ngram
    Parts of Speech Tagger and Sentiment Analysis are added
    Adding Topic Modeling

  4. Finally, click done.
    Bottom of Add Tool screen, with Done highlighted

  5. The Analyze page now lists your chosen tools.
    Note: once you run the various tools, you can return to this page to view your analyses and visualizations.
    Analyze page with four tools now available

Document Clustering

(back to Table of Contents)

  1. On the Analyze page, under the Document Clustering tool, click View.
    Analyze page, under Document Clustering tool, with View highlighted
    Note: while there is an option to run multiple (or all) tools simultaneously, without prior setup, this option would use the default cleaning configuration. Since we made a custom cleaning configuration, and so that you can see some of the options for each tool, we’ll run each tool individually.

  2. Just like with the cleaning configurations, you can customise your settings for each tool. Name the tool setup Stein Cluster, set the Cleaning Configuration to Stein, and set the number of clusters to 3.
    Tool Setup page, with the name, cleaning configuration, and number of cluster fields highlighted

  3. Finally, click run.
    Document Clustering tool with Run highlighted

  4. The tool will then begin processing.
    Document Clustering tool with large Processing sign

  5. This analysis job is processed on Gale’s servers. Let’s wait a minute.
    Note: every tool has an About icon, which gives you a popup with some basic information and a link to fuller documentation. All of the tools in the Digital Scholar Lab are free tools you can download and run on your computer (sometimes they run on their own, other tools like Document Clustering rely on a programming language like Python).
    Click on the About button for information about how this tool works.
    Document Clustering tool with About button highlighted
    Note: the Document Clustering tool uses the Python programming language to run a Python library called SciKit-learn. It uses machine learning, specifically the k-means clustering algorithm, to group the documents based on their similarity in terms of vocabulary. If you want additional information, or if you want to try downloading and running the tool on your own computer, click on Learn More.
    About Document Clustering window with links highlighted
    Click Ok to dismiss the popup. Then, Refresh the page (CTRL+R or F5 in Windows, Command+R on Macs.)

  6. Once the Run Status says Completed, click on the words Scatter Plot under the scatter plot icon. If, after refreshing, it still says Processing, wait a minute and refresh again.
    Tool Setup page with Run Status Completed

  7. This process should create a graph similar to this one:
    scatterplot with hundreds of dots clustered into three colour-coded arrangements
    There are hundreds of dots on a scatter plot, in a roughly triangular arrangement, clustered into three colours: black, orange, and green. Each dot is an individual text, and the more similar two texts are (in terms of their vocabulary), the closer their corresponding dots are. Each cluster is colour-coded and uses a different symbol for each point.
    Note: Although there are scales on the X and Y axes, they have no meaningful numeric significance. Instead, they provide a reference for measuring distance or closeness between individual points. For more information on the mathematics of how this algorithm works, see this video on the k-means algorithm.
    Mouse over one of the dotsa popup appears with the title of text you chose. Clicking will bring you to the Doc Explorer view of that text, but don't do that, as we want to look at the corpus as a whole.
    Same scatterplot, with the mouse cursor over one dot. The text's title, "Buddhist Paintings at the Festival of Empire", appears in a popup.
    Now, a word of caution: what looks similar to the algorithm might not be meaningful to humans. In our case, however, there are some meaningful differences that look to me a little bit like different genres. Mouse over several of the dots and see if a pattern emerges. It looks roughly like Cluster 1 consists of short notices and bulletins, including biographical details (Stein’s publications and obituary). Cluster 2 contains brief reports of his expeditions and discoveries. Cluster 3 seems to be reviews of his published books. Note also that a number of book advertisements with similar names (MacMillan and Co) are clustered together, showing that the clustering tool recognises that they are similar in terms of their vocabulary.

  8. For this scatter plot, you can download both the graph itself and the data used to generate it. On the toolbar, click Download.
    Document Clustering page with Download button highlighted

  9. On the Download Options window that appears, under Visualization download formats, select PNG and then click Download.
    Download options window

Named Entity Recognition

(back to Table of Contents)

  1. Let’s try our second tool. Click on Analyze. 
    Analyze button on toolbar highlighted
    Then under Named Entity Recognition, click view.
    Analyze page, with View under Named Entity Recognition highlighted

  2. The Named Entity Recognition tool will extract common and proper nouns from our texts, and then attempt to classify them as people, titles, places, political entities, and so on. Name our tool Stein NER, set the Cleaning Configuration to Stein, and click Run.
    Named Entity Recognition tool setup with name and cleaning configuration chosen

  3. While you wait, feel free to click on the About button for information about how this tool works.
    Note: Named Entity Recognition uses a tool called spaCy, which runs in the Python programming language. It breaks up sentences into their component words, and then classifies the nouns. By default, it uses an existing vocabulary, and then tries to predict from context what classes new nouns fall into. It then tags and colour-codes these categories, and tells us where these terms can be found in our corpus.
    Click refresh and then click on Entities Found.
    Named Entity Recognition completed, with the results button highlighted

  4. The Named Entity Recognition tool displays the set of nouns with the highest count across your corpus. The algorithm has classified, or in some cases guessed, into which category each word falls. Looking over the first few entries, it looks fairly good: it has correctly identified that "Chinese", "British", "Indian", and "Soviet" refer to cultural groups, that Aurel Stein is a person, and that "today" is a date. There are some issues, likely due to the unfamiliar terms: “Kashgar” is actually a city, not a person, but most entries are correct. I doubt that Kashgar, a Central Asian city, appears in the tool's default dictionaries, so in this case the tool guessed that Kashgar was a person based on its usage in various sentences.
    Results of the Named Entity Recognition tool
    You can click on any entity for more information about where it appears in our corpus, and what other categories it also falls into. Let's try this.
    Click on the first entry, Chinese.
    results, with first entry Chinese highlighted
    Let's examine the results for this particular named entity.
    Popup for the term Chinese. Identified as Cultural group, secondarily as a language and kind of person. Below, a list of documents.
    At the top, "Chinese" is identified primarily as a Cultural Group, which means that in your corpus, "Chinese" most frequently refers to groups of Chinese people or officials. Below, under "Term also identified as...", "Chinese" is also identified less commonly as a language - the Chinese language - and as a person, or rather, as people described as Chinese. Because Stein was primarily travelling through western China, often dealing with Chinese officials and manuscripts in Chinese, it makes sense for it to be the most common entity and also for these three meanings to dominate.
    In the second section of the popup window, there is a list of all documents in the corpus in which the term "Chinese" appears.
    Click on "A Chinese Expedition Across the Pamirs and Hindukush A.D. 747." 
    results popup with "A Chinese Expedition..." highlighted in the list of documents

    Gale DSL will then present you with a view of the document, marked up with tags showing how it has identified each entity within the document.
    Document view with the cleaned text, with tags like DATE after February, GEOGRAPHY after Asia
    Note that this version of the text has been filtered through the cleaning configuration, so many common words (e.g. "the") and punctuation are missing.
    Now that you have seen how Named Entity Recognition works on a particular document, close this window by clicking on the X.
    popup with X highlighted

  5. You will now return to the popup for the term Chinese. Another feature available in the Digital Scholar Lab is to use the Named Entity Recognition tool to identify documents related to a single topic, and then create a new collection with some or all of the documents with that term. Let's say that you are interested specifically in Stein's relationships with the Chinese and want to create a sub-collection focusing on Chinese art and archaeology. Check the boxes next to "A Chinese Expedition Across the Pamirs and Hindukush A.D. 747," "Exploration in Central Asia," "Buddhist Paintings at the Festival of Empire," and "The Treasures of Asia." Then click on Add to Content Set.

    You can either add these documents to an existing content set or create a new one. Select "New Content Set";
    Popup with New Content Set highlighted
    name it "Chinese art and archaeology".

    Create Content Set popup with Name added and Create highlighted

  6. You have now created a new collection. Click on Add (next to the content set "Chinese art and archaeology") to add the selected documents to this collection. Now, close the notification and then close the window.
    Adding selected texts to the newly created collection
    Note: you can check to see the new collection by clicking on My Content Sets. Named Entity Recognition provides an additional way of creating content sets outside of searches. Be sure to return to the Named Entity Recognition tool to continue with the next step.

  7. There are many categories that aren’t helpful right now, so let’s filter so that we just have categories we’re interested in. First, uncheck Entity Categories.
    entity categories highlighted
    Then, check Geo-Political Entity, Person, and Cultural Group.

    All entities are unchecked, and the geo-political entity, person, and cultural group are highlighted
    You will now have a list consisting of just those three categories.
    entity list with just three categories displayed
    Now we can begin to answer questions like, “Which cultural groups are mentioned most often in association with Aurel Stein?” or “Who associated with him? Who does he write about?” As with the Clustering tool, you can download the complete data for further analysis by clicking on the Download button.
    download button highlighted
    Although there are no data visualizations, like graphs, you can download the data as either CSV (a spreadsheet) or JSON. Click on CSV and then Download.
    Download options with CSV and Download highlighted
    Note: this dataset might need some cleaning, as you saw with the city of Kashgar being incorrectly interpreted as a person.

Ngram

(back to Table of Contents)

  1. One popular tool is Ngram (you might know of it through Google Ngram Viewer), which counts the frequency of specific words in a corpus. Click on Analyze and then click View under the Ngram tool.
    Analyze page, with View under Ngram highlighted

  2. You’ll see that the Ngram tool has a few settings. Let’s change the Name to Stein Ngram, the cleaning configuration to Stein; leave the Ngram sizes at their defaults (min 1, max 4); set Ngrams Occurence Threshold to 5; and return 75 ngrams. Let’s name our tool setup Stein Ngram. Finally, click run.

  3. The Ngram tool counts the number of times a term or a phrase occurs in our corpus and then graphs it. The Ngram size option refers to the length of the term: 1 means a single word, whereas 3 means that it looks for three-word phrases. By default, the tool looks for single words or phrases up to four words long. Ngrams Occurrence Threshold is the minimum number of times a word or phrase must occur in the corpus to be part of the Ngram. By setting it to 5, we cut out words or phrases that occur just two or three times in our thousands of documents. Number of Ngrams Returned determines the number of terms that appear in our visualizations. The default is 1000 but we've reduced that to 75 for visualization purposes. Although you might want to know the top 1000 terms for your research, it can look a bit cluttered in a wordcloud. One key point is that this tool ignores any terms that are composed entirely of words from our stop word list. Once a minute has passed, refresh the page.

  4. Now the word cloud and bar chart are both available. Click on the word cloud.
    Completed Ngram tool, with Word Cloud highlighted
    This image visualizes the most common words in the corpus, while accounting for our specific cleaning configurations and tool settings.
    Colourful word cloud with 75 key terms, among them Chinese, Russian, Stein, British, road, river, etc.
    If you mouse over a term, you can see how many times it appeared.
    Interpretation: the more frequently an ngram appears in the corpus, the larger it appears in this word cloud. As we saw with the Named Entity Recognition tool, Chinese (referring to Chinese culture, Chinese people, and the Chinese language) are major topics in writing by or about Stein. In addition, British, Majestys (i.e. "In His Majesty's Service"), Subject(s), and other terms referring to British imperialism from 1900-1940 appear. There are several terms related to specific places (e.g. Kashgar) and official positions (e.g. Taoyin, Amban) in China or the Chinese government. Since Stein recorded his travels, many words related to geography (route, river, stream, valley, pass, road, ground, bank, etc.) occur. "Russian" and "Soviet" are also major terms because the British Empire was competing at this time with Russia (both imperial and then later Soviet) for control over the territories between British India and Russian Central Asia. (For more information, look up The Great Game (in English) or the Tournament of Shadows (in Russian).)

  5. Word clouds are popular and it’s useful to be able to generate them. Click on Download.
    Ngram result page with Download highlighted
    Here you can download either the Word Cloud image or the underlying data from the Ngram tool (which lists the most frequent words and phrases in the corpus). Let's download the image as a PNG. Click on Term Frequency PNG and then Download.
    Download options popup with PNG highlighted
    Note: unfortunately, there is no way to control the position or colour of specific words - they appear to be randomly generated. By resizing your browser window, you resize (and thus randomly reposition) some of the words in the world cloud.

  6. Let’s switch to the bar chart. On the left sidebar menu, click on Bar Chart.
    Ngram results page with Bar Chart highlighted

    This provides another way of looking at word frequency. Here, we can see at a glance which words were most popular. 
    Horizontal bar chart displaying terms in descending order. Top words are Chinese, Kashgar, British, Stein, and road.

Parts of Speech Tagger

(back to Table of Contents)

  1. The Parts of Speech Tagger compares the writing styles of different authors by counting their use of different parts of sentences, such as proper nouns and adjectives. Begin by clicking on Analyze in the top menu. If necessary, select Stein from the dropdown menu.
  2. Under the Parts of Speech Tagger tool, click View.
    Click ing on View under Parts of Speech Tagger
  3. The cleaning configuration we have created removes many parts of speech that this tool counts. In order to show how cleaning configurations affect this tool, and to better understand the various authors' styles, we'll run this tool twice, once without a cleaning configuration, and once with. Begin by naming this tool setup "Stein no cleaning". Change the cleaning configuration to None. Then finally, click Run.
    Filling in the fields for the Tool Setup
  4. Wait for a minute, then refresh. Once the analysis is complete, click Series View.
    Analysis complete
  5. With the tool complete, DSL will produce a graph comparing the styles of several authors. Each tick on the X-axis represents a different part of speech - pronouns, proper nouns, adjectives, etc. - and the Y-axis represents their frequency. Differently coloured lines, with different symbols for each point, represent the various authors.
    Line graph with 10 authors overlaid in a jumble of 10 coloured lines
    This is a useful start, but it is too cluttered. There are far too many authors to meaningfully parse; moreover, Aurel Stein doesn't appear on this graph!
  6. On the left sidebar is a list of authors. Click on the coloured symbol next to the first ten authors to deactivate them.
    Author Filter sidebar with the icons for the first 10 authors highlighted
    All of the symbols and names should now be greyed out.
    Author filter sidebar with all authors deselected
  7. In the Author Filter, type "stein" (without quotation marks). Select "Aurel Stein.", which should be the first option. Leave off the others.
    Author Filter with Stein in the filter bar and three variations of his name appearing in the search
    Note: Stein's name appears several times here because his name has been entered with variations in the documents in DSL. This tool treats each variation of his name as a different author. The variation we chose above is associated with the most texts.
  8. After adding Aurel Stein, type "giles" in the Author Filter. Add "Lionel Giles and Lionel Giles;". Both authors will be added to the accompanying graph.
    Giles in the author filter; Stein and Giles appear on the graph
    Note: Lionel Giles was a contemporary of Stein's, and was a scholar who worked at the British Library. His name appears twice because in one text, "Lionel Giles" and "Lionel Giles;" (note the added semicolon) are both added as authors. This was probably done so that either version would appear in searches, but it causes some confusion for this tool.
  9. Type "mirsky" in the Author Filter and add Jeannette Mirsky.
    Searching for Mirsky in the sidebar
    Note: Jeanneatte Mirsky wrote a biography of Stein in 1977. Although she's writing about similar content as him, she wrote much closer to the present day, so any differences revealed by the Parts of Speech Tagger are likely due to their different styles rather than content.
  10. Type "lattimore" in the Author Filter, and add Owen Lattimore.
    Searching for Lattimore in the sidebar. Both Stein and Lattimore are symbolised by pale blue triangles
    Note: Owen Lattimore was a scholar of Asia, but his writing style differs markedly from Stein's.
    Both Aurel Stein and Owen Lattimore are symbolized by the same symbol (a downward-pointing triangle) and the same shade of blue. Unfortunately, there is no way to adjust symbolization in the DSL, so this is an issue you should watch for.
  11. Finally, delete all text from the Author Filter to see the complete legend.
    Removing all text from the author filter
    The final version of this graph reveals some major differences in writing styles: Lattimore uses a much higher proportion of proper nouns than anyone else and Mirsky uses the highest proportion of adjectives.
    Second version of the graph, with just four authors.
  12. Now, let's try rebuilding this chart, but with our existing cleaning configuration. In the top right corner, click on Tool Setup.
    Page with the results of the tool, with Tool Setup highlighted
    Above the existing tool setup, click on New tool setup.
    Tool setup, with New tool setup highlighted
  13. Name this setup "Stein with stopwords". Change the cleaning configuration to Stein. Finally, click Run.
    Filling out fields for the new tool setup; cleaning configuration is now Stein
  14. Wait a minute, then refresh the page. When the run status is completed, click Series View.
    Tool result, with two entries in the Run History, and the Series View button highlighted
  15. Just as before, use the Author Filter to add Aurel Stein, Lionel Giles, Jeannette Mirsky, and Owen Lattimore. Deselect all other authors.
    Highlighting the Author Filter bar
    Now there are practically no conjunctions, particles, adpositions, or pronouns. As a result, some of the other parts of speech take on proportionally more weight. It is up to you as a researcher to determine which cleaning configuration - or none at all - works best for your aims. When in doubt, using no cleaning configuration is best for the Parts of Speech Tagger tool.
    Third version of the graph, with some parts of speech removed by the cleaning configuration

That's it! You have learned how to use the Parts of Speech Tagger to begin comparing the styles of different authors, and you have learned what kind of impact the cleaning configuration has on your data.

Sentiment Analysis

(back to Table of Contents)

  1. Sentiment Analysis is a powerful tool for estimating the overall positive or negative emotional feelings of thousands of texts very quickly. At the top toolbar, click on Analyze. Then, scroll down to the Sentiment Analysis tool and click View.
    Adding Sentiment Analysis in the Analysis page

  2. In the Tool Setup for Sentiment Analysis, call the tool setting "Stein Sentiment", change the cleaning configuration to Stein, and finally, click Run.
    Tool Setup for the Sentiment Analysis tool

  3. As always, you can click on the About button to get more information.

    Note: Sentiment Analysis works by analyzing the words in a sentence, and looking them up in a dictionary that has a positive or negative score for many words. For example, the word "good" has a score of (positive) 3, and the word "unhappy" has a score of -2. The tool sums all scored words in a document and DSL then groups documents by year and averages their scores to produce yearly scores. You can download the full word list and their associated scores from AFINN.

    Warning: the sentiment analysis tool does not understand context or meaning. It cannot distinguish sarcastic statements from sincere ones and it will not recognize words not on its list. Furthermore, without additional coding, it does not recognize negations, e.g. that "not impressed" means roughly the same thing as "unimpressed." It also embeds certain cultural assumptions and values: one of the example phrases in the Python code used to run this tool is "Rainy day but still in a good mood," where "good" is +3 and "rainy" is -1, for a sum of 2 for this sentence. The speaker might actually enjoy the rain, but this tool cannot account for that. These criticisms do not mean that the tool is useless, but that it is most effective when dealing with a large number of relatively straightforward texts. Like all tools in the DSL, it can be powerful (giving a rough estimate of the sentiment of thousands of texts within a minute is beyond human ability) but you need to understand how it works (and where and when it does not work).

    Refresh the page. Once the Run Status is Completed, click on Time Series under Results.
    Tool Setup page with run status: completed

  4. Let's take a look at the results:
    line plot with points representing each year. mostly positive except for documents after 1980
    The x-axis represents years, and the y-axis represents the average sentiment score of all texts for a given year. This means that each point is an average of all words in all of the texts for a given year.

  5. You can get more information about specific years by clicking on the associated point. Find the point for 1920 and click on it.
    same line plot with the point for 1920 highlighted

  6. A popup will list all documents associated with a specific year. Some years have many associated documents, whereas others have only one. You can click on document titles to go to the page for that document. Once you are done, click Close.
    popup listing three texts

  7. Some years have extreme scores, either very high or very low. Click on the point for 2011, which is the rightmost point.
    rightmost point of line graph, with the point for 2011 at -2 score (very low)

  8. There is only a single document for 2011. In general, since DSL averages the scores of all texts for each year, the most extreme points (the highest and the lowest) often have only one text. Let's investigate further. Click on the document title, "Turkey." DSL will open the doc explorer view. This text actually has almost nothing to do with Stein - it mentions him once in passing - and instead is a rather negative review of a book. Return to the Sentiment Analysis tool and click Close.
    Popup for 2011 has only a single text

  9. We can have more useful results by removing documents like these. Let's create a new collection that is a subset of this one. On the top toolbar, click My Content Sets.
    Line graph with my Content Sets highlighted

  10. Click on New Content Set.
    Content Sets page with New Content Set button highlighted

  11. Name it Stein sentiment.
    Popup for naming content set

  12. You should now see it in your list of content sets. Do not click it yet, though.
    List of all three content sets

  13. On the top toolbar, click Build. Then, on the Build screen, near Search, click Advanced Search.
    Build screen with Advanced Search highlighted

  14. Under Advanced Search, click Add a Row.
    Advanced search, with Add a Row highlighted

  15. Keep all four rows set to the Keyword field. Change the operators to OR. For each of the four fields, type Aurel Stein (as with the first content set) and pick one of the four versions of his name for each row.
    Advanced search, with four variations on Stein's names
    All four variations on Stein's name added
    However, once you are done, do not click search.
     We are going to add more qualifications to remove problematic documents.

  16. Scroll down. Under More Options, and under Publication Year, pick Between.
    Publication Year(s) option, with Between highlighted
    For the first dropdown menu, type "1890" and press Enter. For the second year, type "1950" and press Enter.
    Years between 1890 and 1950

  17. Under publication language, pick English. Then, click Search.
    English and Search highlighted

  18. On the search results screen, underneath All Content, check Select All.
    Content page, with 234 results, and Select All highlighted
    It will then say that "All 100 results on this page are selected." Click on "Select All (234) results."
    In the popup box, specifying not just 100 results on this page, but all 234 results

  19. Now it should say, "234 results are selected." Make sure that the Active Content Set is set to "Stein sentiment".
    Results page with Stein sentiment in the Active Content Set and "234 results are selected" in the selection results

  20. Then, click the "Add to content set" button in the top.
    With all 234 results selected, highlighting Add to Content Set
    There should be a popup that says "Added 234 document(s) to Stein sentiment." Click View Content Set.
    On confirmation popup, highlighting View Content Set

  21. Since advertisements often include very positive language for marketing reasons, and since several of the advertisements in this collection only mention Stein briefly, we will improve our sentiment analysis by removing them. 
    Once you are in the Stein sentiment content set, click Documents.
    Stein sentiment collection, with Documents (234) highlighted
    On the left menu, under Document Type, click on Advertisement.
    Advertisement (28) highlighted

  22. Once you are on a screen with only the advertisements, under Documents, check Select all on page.
    Search results, with Document Type: Advertisements. "Select all on page" is highlighted
    Then, on the top right of the page, click on Remove from Content Set.

    "Remove from Content Set" is highlighted
    There will be a popup that notifies you that these documents have been removed from Stein sentiment. Click Close.
    Closing confirmation popup

  23. You've trimmed your content set down to something more focused. On the top toolbar, click on Analyze.
    Analyze button highlighted
    Change the content set to Stein sentiment if you are not there already. Then, click Add Tool.

    Remember that since this is a new content set, you have to add tools again, but we'll only add Sentiment Analysis for this content set. We'll return to the previous content set for the next tool.
    Selecting Stein sentiment on Analyze page, and highlighting Add tool
    Scroll down, and add just the Sentiment Analysis tool.
    Adding Sentiment Analysis
    At the bottom of the screen, click Done.
    After adding just Sentiment Analysis, highlighting Done button

  24. Under Sentiment Analysis, click View.
    View highlighted under Sentiment Analysis Tool Setup

  25. First, name this tool setup "Stein sentiment 1890-1950 no ads". Second, set the cleaning configuration to Stein. Third, click Run.
    Tool Setup with name, cleaning configuration, and run filled and highlighted
    Wait, refresh, and when the analysis is complete, click Time Series.

    Completed analysis

  26. On average, there are more now documents per year. After removing ads and texts from considerably after Stein's life, the remaining texts are more representative. Explore by clicking on the various points.
    Line graph of sentiment, mostly positive, with minor dips around 1931, 1939, and 1946
    One year in particular stands out for having a large number of texts with a largely negative sentiment. Click on the point for the year 1931.
    Zoom in with the point representing 1931 highlighted
    Stein had three well-regarded expeditions to western China and Central Asia. In 1931, though, his tentative fourth expedition was cancelled and he was expelled from China. Here you can see a large number of newspaper articles reporting on this event. Click on Close once you are done.
    Long list of documents for 1931, including "Sir Aurel Stein Order to Leave"
    That's it for Sentiment Analysis! You learned how to use it, interpret the results, and how to use an advanced search to create a subset of your collection to improve the tool's results.

Topic Modeling

(back to Table of Contents)

  1. For the last tool we’ll look at today, let’s try Topic Modeling. Before we use this tool, we need to modify our Cleaning Configuration, because Topic Modeling is case sensitive. This means that it will consider capitalized words to be different from uncapitalized ones. At the top toolbar, click on Clean.
    Analyze page with Clean button highlighted

  2. Click on Stein under Your Configurations.
    Cleaning Configuration page with Stein configuration highlighted

  3. Under Text Correction, check the box for All lower case. Then, click on Save As on the upper toolbar.
    Cleaning Configuration page, with All lower case and Save As highlighted
    Name the new cleaning configuration "Stein lower case", then click Submit.
    Save As popup, name is Stein lower case

  4. Now, let's use this new cleaning configuration in the Topic Modeling tool. Click on Analyze (top toolbar) to get back to the analysis menu, and change the Content Set back to Stein (if it isn't already).
    Analyze dropdown menu, changing content set back to Stein

  5. Then, under Topic Modeling, click on View to set up this tool.
    Topic Modeling tool on Analyze page with View highlighted

  6. Name this tool "Stein Topics", set the Cleaning Configuration to Stein lower case, and increase the Number of Topics to 20. Leave Words per Topic and Iterations at their defaults. Click run.
    Topic Modeling tool setup, with name, cleaning configuration, number of topics, and Run highlighted

  7. Once a minute has passed, refresh the page.
    Note: The Topic Modeling tool uses a free program called MALLET, which stands for Machine Learning for Language Toolkit, which itself uses an algorithm called Latent Dirichlet Allocation, or LDA for short. Basically, it looks for words that occur together often in the corpus, and then brings them together as a “topic.” This tool uses a certain degree of randomness, which is offset by running the tool many times in the background - this is what the Number of Iterations refers to. If you would like to install MALLET on your own computer, use this MALLET installation guide and tutorial.

  8. Once the analysis is done, click on Topic overview.
    Tool Setup for Topic Modeling, run status is Completed, and the Topics button is highlighted

  9. You will have twenty topics, numbered from 0 to 19. Each lists the words that appear together most frequently along with some summary statistics about each word (its count, probability of appearing, and number of documents in which it appears).
    Topic overview page, listing twenty topics, with metadata and top terms
    If you scroll through the list, most topics should look like they refer to similar topics or themes. There’s a bit of randomization in this tool, so your topics and their numbering will vary from those in these examples, but the large number of iterations ensures that most of the time, the tool produces topics that are fairly similar to each other. Here, my topic 12's words are river, route, road, pass, valley, small, bank, water, track, and camp. They appear to have a strong focus on geography and travel. Knowing in which documents a topic appears can be helpful for naming a topic, so for any topic, under Identified In, click on the number of documents.
    Under my topic 12, with words like river and route, 45 documents is highlighted
    You will then have a list of all documents in which this topic appears, which is helpful for determining the genres and content of this topic. As you might have suspected, my topic 12 has a number of documents related to both Stein's travels and those of others (e.g. Alexander the Great), along with texts like "Ancient Ways in Iran" and "Routes in Sinkiang" (i.e. modern Xinjiang Province in western China).
    Popup showing Documents associated with my Topic 12
    Click on the X to close this popup.

  10. Now that we have confirmed this topic's strong focus on geographic texts, let's name it Geography. Click on the topic's title and rename it "Geography".
    Clicking on the topic name
    Then, click Save.

    Renaming the topic Geography, with Save button highlighted

  11. Go through the list and give titles to each of the topics. You should be able to guess at the general theme of each topic, especially if you click on Identified In... X Documents and consider which documents appear in each topic. 
    Another example is my topic 9, which includes words like "Soviet," "bureau," and "intelligence." These suggest the British government's involvement in the Great Game, particularly intelligence-gathering, so I named this topic "Intelligence Gathering."
    Note: it is very likely that one or two of your topics resembles my topic 6, with non-English words like "gyappa" and "nangwa" included alongside the word Tibetan. If you click on "Identified in: 2 Documents," you will likely see two English-language textbooks for learning Tibetan. These topics thus include a number of Tibetan words, grammatical terms, and words like "tea" that are occur frequently in the example sentences in these books. Feel free to name this topic Tibetan.
    If there are any topics which are completely unclear, feel free to leave them untitled.
    Note: clicking on the Download button here allows you to download the data for each of these topics. If you are in another view, such as the Topic Comparison or Topic Proportion views described below, you will have access to different download options.

  12. Now that we’ve seen roughly what words and themes our topics cover, let’s look at how common these themes are in our corpus. On the left sidebar, under Views, click on Topic Comparison.
    Topic Comparison radio button highlighted on left sidebar

  13. You now can access a number of data visualizations, each one describing the relationship between the documents, topics, and words in your corpus.
    Topic comparison, with a line graph displaying tokens in each topic
    The available topic measures are Tokens, Document Entropy, Average Word Length, Coherence, Uniform Distance, Corpus Distance, and Exclusivity. The tool defaults to Tokens.
    Tokens measures how often words from specific topics appear in the entire corpus.
    Document Entropy measures the probability that each topic appears in a randomly selected text.
    Average Word Length measures word length, with longer words suggesting more specific (and therefore meaningful) topics.
    Coherence measures the likelihood that words within a topic appear next to each other.
    Uniform Distance suggests which topics are the most specific.
    Corpus Distance measures the distance from words in a topic from the corpus as a whole, suggesting which topics are most distinct from the rest of the corpus.
    Exclusivity measures how often the top words in a topic co-occur with top words in other topics. 

  14. Switch to Exclusivity by clicking on the dropdown menu under Topic Comparison By, and select Exclusivity.
    Topic Comparison dropdown menu highlighted
    Dropdown menu with Exclusivity highlighted
    This is the resulting graph:
    line graph of Topic Comparison by Exclusivity. The result for Tibetan Grammar is highlighted.
    The Topic Modeling tool produces a graph. The highest point here, i.e. the topic that is the most exclusive, or has least in common with other topics, is for Tibetan Grammar, at 0.831 (out of a possible total of 1.0). Since the other topics tend to be about geography, politics, and archaeology, it makes sense that a topic about Tibetan grammar is distinct.
    Note: if you click on the Download button when in the Topic Comparison view, you can either download the data about the measurement you are currently viewing or the graph of the results.

  15. If you find results like this useful, you should also know that there are some shortcuts: under Views, click on Topic overview. 
    Topic overview button highlighted
    Under each topic, it lists the various measurements described above. Each of these is a hyperlink, so if you click on Average Word Length, you will immediately be taken to the graph and data for Average Word Length.
    Topic view, with the Topic Measures highlighted

  16. In addition to these bird’s eye views of our entire corpus, we can also see the topic breakdown for each text. At the top of the screen, click on Results, and change Topics to Topic Proportion.
    Menu in Topic Comparison view, with the Results dropdown menu highlighted
    Dropdown menu with Topic Proportion selected

  17. Now you’ll see a colour-coded proportional bar graph for each text, showing what percent falls under each topic. One advantage of this viewer is that it displays, at a glance, whether, and how much, a given topic appears in select documents of interest. 
    Topic Proportions, with coloured horizontal bars for each texts representing the proportion of each topic in each text
    You can click on individual topics to show just them. Try hovering over the first section in the first text. (For my corpus, that is the purple "Stein (general)" topic section in the text called Recent Literature.)
    hovering over the first coloured segment in the first text: topic Stein (general) makes up 26% of the text Recent Literature
    A popup appears showing you the percentage of that text composed of that topic. Click on the section.
    Topic proportions with only the topic Stein (general) being shown for each text.
    The viewer will then only display that topic in all of the listed texts.

  18. Although this tool by default shows only the first fifteen documents in your corpus, you can also specify which documents you want to view. On the left sidebar, under Documents Displayed, click on Select Documents.
    Topic legend, with Select Documents highlighted
    A popup will appear listing the documents currently displayed. Click on the search bar.
    Select Documents to Display popup, with the document title search bar highlighted

    Let's look for texts related to Kashgar, a city in western China that Stein visited, which was also tied to local governance and travelers. Type Kashgar. You will see two texts appear. Check the boxes next to each text to include them, then click Done.
    search bar showing Kashgar, with two texts listed

    When you return to the Topic Proportion screen, both texts related to Kashgar are now at the bottom of the list. By hovering the mouse over their various sections, the tool will reveal which topics compose these texts. 
    Topic proportions for both texts on Kashgar. Hovering over Kashgar: Monthly Diaries (1912-1920) shows that it is 53% about the Great Game

Export

(back to Table of Contents)

  1. Let's conclude by exporting texts from Gale Digital Scholar Lab. You can use these texts for your own research and if you have downloaded any of the tools used above (e.g. MALLET for topic modeling or perhaps you are trying spaCy in Python), you can use these texts in those tools. Begin by clicking on My Content Sets on the top toolbar.
    topic proportion page with My Content Sets highlighted

  2. Then, from the list of Content Sets, click on Stein.
    Content Set list with Chinese art and archaeology and Stein

  3. On the Stein Content Set overview, click on Download Content Set.
    Stein content set, with Download Content Set highlighted 

  4. In the popup that appears, leave Cleaning Configuration at its default, None, and click Generate Download.
    Download Content Set popup
    Note: you can download a maximum of 5000 documents at one time. The maximum size of a content set is 10 000 documents.

  5. Back on the Stein Content Set overview, the Download Content Set button has changed to Generating Download. This is because Gale's servers are preparing the texts for you. Wait a minute or two, then refresh the page.
    Stein content set, with the button below now reading Generating Download

  6. Once the Generating Download button has changed to Content set download ready, click it.
    Stein content set with button below now reading Content set download ready
    Then, on the Download Content Set popup, click Download.
    Download Content Set popup with Download button highlighted

  7. Your texts will be bundled together in a file called download.zip.
    If you are using a Windows computer, use an unzipping program like 7-zip (download here) to access the files. Right click on download.zip and select "Extract here".
    If you are using a Mac, double click on download.zip.
    When the folder has been unzipped, open it.

  8. You will now have a folder with a README.txt file and a folder called "original." Open the "original" folder.
     Window (in Mac) with original folder highlighted

    Inside there is one file per text in your collection.
    List of text files with _Sikandar__the_Great highlighted
    Open the text file titled _Sikandar__the_Great_FP1800972885.txt (with the underscore at the beginning; it should be the third text, alphabetically).
    Text file of  "Sikandar" the Great
    This text has some OCR errors throughout but is mostly legible.
    You now have access to OCRed copies of all of the texts in your corpus.

  9. Now that you have access to the original texts, let's also download texts that have been cleaned using our cleaning configuration. Click on Content set download ready again, but this time, click on the dropdown menu under Cleaning Configuration.
    Download Content Set window with the Cleaning Configuration dropdown highlighted
    From the list, choose Stein lower case.
    Download Content Set window with the dropdown menu under Cleaning Configuration open
    Then, at the bottom of the window, click on Regenerate Download.

    Download Content Set window with Regenerate Download highlighted
    Wait a minute or two again, refresh, and download the new set. 
    Unzip the folder, open the folder called Clean, and then open the text titled XX again.
    The text of "SIKANDAR" the Great with stop words removed
    This is the same text, after being cleaned through the cleaning configuration. All stop words (e.g. "the") have been removed, as have punctuation, numbers, and special characters, and all words are in lower case. It is no longer a readable text for humans but it helps immensely when running tools like Ngrams or Topic Modeling (via MALLET) on your own computer.

  10. You can also download the metadata for your records. Click the Download Metadata button. 
    Stein collection overview, with Download Metadata button highlighted
    Then, click on the Download button.
    Download Content Set Metadata window with Download highlighted

    You can download the metadata for up to 10 000 documents in your collection. You will receive the data as a .CSV file (which can be opened as a spreadsheet, e.g. in Excel).
    Metadata from the Stein collection, opened as a spreadsheet in Excel

That's it! You have now used Gale DSL to assemble a corpus of texts, create a unique cleaning configuration, use four of its tools, and export the resulting data, visualizations, and texts. If you have any questions about text mining or the DSL specifically, please feel free to contact Digital Scholarship Services.

Additional Training

(back to Table of Contents)

You have now completed all steps that this tutorial covers, but here are some ways to get additional training:

See Which Texts are Available

(back to Table of Contents)

To see which collections of primary sources are available for your research, follow these steps:

  1. Return to the DSL main page by clicking on the Digital Scholar Lab logo.
    Logo highlighted on main page
  2. Scroll down until you see the "Learn More About the Lab" menu.
    menu under learn more about the lab
  3. Under "What Texts are Available?", click View the Archives.
    Learn more about the lab with View the Archives highlighted
  4. You will be taken to a screen with a complete listing of all collections licensed to the University of Toronto. You can then filter specific items or search within specific collections.
    Available texts page listing a number of collections

Learning Center

(back to Table of Contents)

To access the Learning Center, which includes additional documentation and videos, follow these steps:

  1. Return to the DSL main page by clicking on the Digital Scholar Lab logo.
  2. Scroll down until you see the "Learn More About the Lab" menu.
  3. Under "Dive Deeper Into How the Lab Works", click on Visit the Learning Center.
    Learn About the Lab menu with Visit the Learning Center highlighted
  4. You will be taken to a new page with a sidebar menu. In addition to being able to read documentation and watch videos on any step of using the Digital Scholar Lab, you will also be able to access the Frequently Asked Questions (FAQ), Glossary, User Guidelines, and the Privacy Policy.
    Learning center with sidebar menu

Sample Projects

(back to Table of Contents)

Gale's staff, which include a number of professors engaged in digital humanities research projects, created three sample projects for you to use. They provide pre-built collections and pre-run tools centred on specific themes. They also provide extensive documentation on how they constructed their projects, how they fine-tuned their cleaning configurations, and how to interpret their results, which can in turn help you develop your own projects.

  1. Return to the main page of the DSL by clicking on the Digital Scholar Lab logo.
  2. Scroll down until you see the "Learn More About the Lab" menu.
  3. Under "Try Out Sample Projects", click View Sample Projects.
    Learn About the Lab menu with View Sample Projects highlighted
  4. There are currently three projects. (Scroll down if you can't see all of them.) Click on the title of one that interests you.
    Sample Project page with Project 1 highlighted
  5. Once you are on the page for a specific project, you have three options:
    1. Click on "Get a copy" to copy the whole project, collections, cleaning configurations, and pre-run tools, to your account.
      Project 1 with Get a Copy highlighted
    2. Scroll down to see an overview of how Gale's staff created this project, step by step.
      Project 1 Synopsis and Core Research Question
    3. Click on the "Thinking Critically Supplement PDF" to read the report providing greater detail on how the project was created, how to take it further, and what obstacles the staff ran into.
      Project 1 with Thinking Critically Supplement PDF highlighted

Uploading Texts (Feature in Beta)

(back to Table of Contents)

  1. Gale is currently testing a feature whereby you can upload texts of your own and analyse them in the DSL. Although this tool is still under construction (in Beta), it is too useful to pass up. Let's try it! Begin by clicking on Build in the top toolbar.
    DSL main page with Build highlighted
  2. On the right side, there is the Upload box. There are two ways to upload texts: by inputting text directly into DSL or by uploading multiple files simultaneously. We'll try both, but we'll begin with the simpler method. At the top of the Upload box, click on Text Entry.
    Build page, with Text Entry in the Upload Box highlighted
  3. The Text Entry page has a number of fields. Only the Title and Text fields are necessary, but the other fields are useful, both because they help keep your texts organised and because some of the metadata, such as year and date, are necessary for certain tools. Begin by pasting the following text into the Title and Text fields:
    Title:
    Preface
    Text: 
    In the introductory remarks prefixed to this Memoir I have endeavoured to indicate briefly the objects and methods which guided me in the surveys of my three Central-Asian journeys and in the preparation of the maps which contain their final cartographical record.
    It only remains for me to acknowledge with gratitude my manifold obligations for the effective help which alone rendered possible the topographical tasks bound up with my explorations.
    That I was able to plan and carry out those tasks was due to the fact that the Survey of India, accustomed ever since its inception to serve the interests of Help of^nney of geographical research, not only within the vast area forming its own sphere of activity but also beyond the borders of India, supported from the start my aims with the means best suited for them. In Chapter 1, dealing with the history of our surveys, I have had occasion fully to note the services rendered by the experienced Indians whom the various Surveyor Generals deputed with me, and the extent of the help which I received by the provision of instruments, equipment arul funds to meet the cost of their employment. To the Survey of India was due also the compilation and publication of the results brought back by our joint efforts from each successive journey.
    The topographical results thus secured have not only helped me to make my journeys directly profitable for geographical study, they have also greatly facilitated my archaeological explorations in regions which, though largely desolate today in their physical aspects, have yet played a very important part in the history of Asia and its ancient civilizations. But apart from the gratitude I owe for this furtherance of my researches, the fact of my having been able to work in direct contact with the oldest of the scientific departments of India will always be remembered by me with deep satisfaction.
    Ever since in 1899 the proposals for my first Central-Asian journey had received the Government of India's sanction, successive Surveyor Generals did .Surveyor^euorals tlieir best to facilitate the survey tasks of my expeditions. 1 still
    think back gratefully to the very helpful advice and instruction by which the late Colonel St. Georg Gore, R.E., while at Calcutta during the cold weather of 1899-1900, showed his personal interest in the enterprise. His successor as Surveyor General, Colonel F. B. Long, R.E., was equally ready to meet my requests concerning the plans !. had formed for my second and much more extensive expedition of 1906-08.
    But my heaviest debt of gratitude is due to Colonel Sir Sidney Burrard, RE.,
    K.C.S.I., F.R.S., who as Superintendent of the Trigonometrical Survey SidneyCBnrrnrd since 1899 had direct charge of all arrangements for the survey work of my first and second expeditions, and who during his succeeding long term of office as Surveyor General was equally ready to extend to me unfailing support and guidance with regard to the third. Moreover quite as great a stimulus was the thought of his own lifelong devotion to the study of the geographical problems connected with innermost Asia and the great mountain systems which enclose it. I feel hence very grateful for being allowed to dedicate this record of our labours to Sir Sidney Burrard not merely as a most helpful friend and guide but also as a living embodiment of that spirit of scientific research which has never ceased to pervade the Survey of India since the days of Rennell, Lambton, and Everest.


    Note: you have probably noticed that there are some misspellings and other errors in the text above. This is because this text was made by using an OCR algorithm on scanned images of pages. These OCR errors are, unfortunately, a common byproduct of this process, requiring considerable time to manually check for even a single text, let alone the hundreds or thousands of texts in a collection. (I removed a small number of illegal characters, including variations on the apostrophe, to make this text compatible with DSL's requirements.) This is also a fairly error-free example. I took this text from Stein's 1923 Memoir on Maps of Chinese Turkistan and Kansu, specifically this copy of Stein's text from the Internet Archive.
  4. Now let's begin adding metadata.
    Text Entry page with Title and Text filled in, and blank metadata fields highlighted
    For each of the following fields, add the following metadata:
    Author: Aurel Stein K. C. I. E.
    Publication Title: Memoir on Maps of Chinese Turkistan and Kansu
    Publisher: Trigonometrical Survey Office
    Document Type: Book Chapter
    Language: English
    Subject: Geography
  5. Click on the Publication Date field.
    Publication Date field
    When the calendar window pops up, click on the year. Type 1923 and select that year.

    Publication Date popup with Year selected
    Next, from the month dropdown menu, select January. In the month view, click on the 1st. 
    Publication Date popup with Month and Day selected

    Note: we only have the year of publication for this text, but DSL requires a full date, with day and month, for their data. For this reason, we added January 1st, even though in practice you can ignore the month and day.
  6. This text should now have all of its metadata fields complete.
    Complete text entry with Create Document highlighted
    Click on the Create Document button at the bottom to upload the text.
    A banner stating "Document Saved" should briefly appear.
    Note: metadata is "data about data," or more specifically, it is information about a dataset (whether that dataset is a book, article, film, image, etc.) such as the author and publication year. It is used by several of DSL's tools (e.g. the Parts of Speech tagger compares the styles of different authors, so it needs texts to have their Author fields filled out). We used "Aurel Stein K. C. I. E." because spelling his name differently, even if it would be recognizable as the same person to a human, will be technically a different author for any computational tools. This particular formulation is one of the existing variations in DSL and is how Stein wrote his name on the title page, so it works best.
  7. Now that this text has been uploaded, let's add it to our collection for further analysis. Click on Add to Content Set.
    Add to Content Set button highlighted
    From the menu, select Stein.
    Content Set listing with Stein highlighted
    Once DSL confirms that this text has been added to the Stein collection, click Close.
    Window with Updated Content Set, confirming text is added to Stein
  8. Let's see where we can review these uploaded texts. Click on Manage Uploads.
    Text Entry page, Manage Uploads button highlighted
    You will now be presented with a list of all texts you have uploaded. So far, there is only one, but we will add more soon. You can also use this page to add texts to Content Sets, Apply Metadata after uploading them, or to Delete them. Click on Build to explore another method of uploading texts.
    Manage Uploads page with Build highlighted
  9. First, download these two .txt files by right clicking on the links below and saving them each to your computer (selecting Save Target As or Save Link As):
    download Introductory material
    download Chapter 1, Section 1
    In the Upload box, under Browse for Files to Upload, click Choose Files.

    Navigate to where you downloaded these files, select both of them (hold down CTRL (Windows) or CMD (Mac) to select more than one item at a time), and click Open.
    Window with two text files selected, Introductory and Section1
    Note: the example above comes from a Mac computer. Your interface may look different.
  10. DSL will inform you that two texts have been successfully uploaded. Check the box next to Successfully Uploaded (2). Then click on the Add Metadata button.
    Upload box with 2 texts successfully uploaded
  11. In the Apply Metadata window, leave it set to the default radio button, Quick.
    Apply Metadata window with Quick mode selected
    Then, add the metadata as before: 
    Author: Aurel Stein K. C. I. E.
    Publication Date: 01/01/1923
    Publication Title: Memoir on Maps of Chinese Turkistan and Kansu
    Publisher: Trigonometrical Survey Office
    Document Type: Book Chapter
    Language: English
    Subject: Geography
    Apply Metadata window with metadata fields highlighted
    Finally, click Apply Metadata.
  12. The metadata have now been applied, but the new texts have not yet been added to a collection. Check the box next to Successfully Uploaded (2), then click on Add to Content Set.
    Upload box with 2 texts uploaded and Add to Content Set highlighted

    As with step 7 (above), select Stein.
    Add to Content Set popup with Stein selected
    Once both texts have been added, click Close.

    Updated Content Set popup with notification that 2 documents were added to Stein
  13. One final method to apply metadata is to use a spreadsheet. This method is best if you are uploading multiple texts with different metadata. Begin by saving these texts to your computer:
    download khotan.txt
    download panoramas_introductory_note.txt
    Then, in the Upload box, under Browse for Files to Upload, click Choose Files.
    Upload box with Choose Files highlighted
    Select both these files and click Open.

    Window with two text files selected, khotan and panoramas_introductory_note
  14. As with before, click on Successfully Uploaded (2) and then on Add Metadata.
    Upload box with both texts uploaded
  15. In the Apply Metadata popup, under "Applying metadata to 2 file(s)," click on the Bulk radio button.
    Apply Metadata popup with Bulk highlighted
  16. Under step 1, click on Download Form.
    Apply Metadata popup, Bulk mode
  17. Depending on your internet browser settings, you might automatically download a file called metadata-template.csv. If you are prompted to download it, choose where you would like to download it and then click save.
    Saving the metadata-template.csv file
  18. Open up the file in a program like Excel, Numbers, or OpenOffice Calc.
    metadata-template opened in Excel
  19. By using a spreadsheet, you can upload metadata for multiple texts quickly, and unlike the Quick mode, you can give different metadata for each text. This method is especially useful if you have bibliographic information for many documents available from another source, like that which might be exported from bibliographic software like Zotero, Mendeley, or RefWorks.
    Input the following data into the spreadsheet.
    for khotan.txt
    Title: Sand Buried Ruins of Khotan - Chapter IV
    Author: Aurel Stein K. C. I. E.
    Publication Date: 01/01/1904
    Publication Title: Sand Buried Ruins of Khotan
    Publisher: Hurst and Blackett
    Document Type: Book Chapter
    Language: English
    Subject: Archaeology

    for panoramas_introductory_note.txt
    Title: Mountain Panoramas from the Pamirs and Kwen Lun - Introductory Note
    Author: Aurel Stein K. C. I. E.
    Publication Date: 01/01/1908
    Publication Title: Mountain Panoramas from the Pamirs and Kwen Lun
    Publisher: Royal Geographical Society
    Document Type: Book Chapter
    Language: English
    Subject: Geography
    Excel with metadata-template.csv open, and all metadata entered
    Important notes:
    Do not change any information in the Document ID column, or DSL will be unable to process the data. The Document ID number is different every time you upload a file, even if you upload the same file more than once.

    The Publication Date column only works if the date is in the format DD/MM/YYYY, like 01/01/1904 for the first of January, 1904. As above, you will have to add days and months for the upload to work, even if the original publication only had a year of publication.
    The texts might be in a different order for you than I have listed them. Be sure to match the data to the right row or you'll end up with texts that have the wrong title and year.

  20. Save the spreadsheet. Be sure to save it in .CSV format (which is the format it comes in). Some programs like Excel will ask you to save in another format, but you must save it as a .CSV for DSL to recognise it.
  21. In DSL, in the Apply Metadata window, in step 2, click on Browse.
    Apply Metadata window with Choose File highlighted
  22. Navigate to where you have saved metadata-template.csv, select it, and click Open.
    Window displaying metadata-template.csv in my download folder
  23. Now that step 2 has changed from "No File Chosen" to "metadata-template.csv" click on Apply Metadata.
    Apply Metadata window with file loaded and Apply Metadata highlighted
  24. You will then return to the Build page. Click on Manage Uploads.
    Upload box with Manage Uploads highlighted
    The two newly added texts, with their metadata, will now be there. Check the boxes next to "Sand-Buried Ruins of Khotan" and "Mountain Panoramas..." and click on Add to Content Set.
    Manage uploads page with two texts checked
    Select Stein. Close the popup that appears.
     You're done! You've now uploaded several texts and added metadata in a variety of ways.

Contact

(back to Table of Contents)

Remember that if you would like help or want to take any of the DSL's tools further in your own analysis, you can always contact Digital Scholarship Services.