This tutorial introduces Gale's Digital Scholar Lab (DSL), a digital humanities tool. In this tutorial, you will learn how to:
- Build a collection of texts
- Clean texts
- Run analytical tools on texts and visualize the results
- Download the data, graphs, and other visualizations produced through this tool
- Download the scanned texts in your collection, so that you can use them in other programs
Note: Gale consistently updates the Digital Scholar Lab, so some features of this tutorial might not always match the latest interface. This tutorial was last updated in August 2020.
Table of Contents
Overview
What is Gale Digital Scholar Lab?
The Digital Scholar Lab (DSL) is an online tool for analyzing texts, visualizing the results, and exporting data, graphs, and texts from the platform. It runs in your Internet browser and does not need any additional software.
The DSL has six analysis tools: (1) Document Clustering, (2) Named Entity Recognition, (3) Ngram, (4) Parts of Speech Tagger, (5) Sentiment Analysis, and (6) Topic Modeling. The DSL makes it easier to learn and understand how these tools work by providing user-friendly graphical user interfaces, documentation, and demonstration videos. External links to the open source code for each tool are also made available should you wish to run the tool on your own computer and use its more advanced features.
What collections does it have?
When you use the DSL through your University of Toronto connection, you can use any of the Gale primary source collections that the University has licensed, including hundreds of thousands of documents in multiple languages with broad historical and geographical coverage. (Once you are logged in, see these instructions to view all accessible collections.) Extensive coverage, however, should not be confused with universal coverage; many perspectives are not represented in these text collections. For example, most of the colonial-era documents included in these collections were produced and collected by colonizing people, organizations, or institutions, rather than by colonized peoples. It is up to you as a critical scholar to decide on which questions can and cannot be answered by these collections.
Digitization
The texts available in the DSL have gone through several steps: (1) various institutions like libraries and archives collected the texts; (2) Gale scanned the text; (3) through a process called Optical Character Recognition (OCR) these scans—which are essentially photographs of texts—are converted into readable, searchable text.
OCR uses image-recognition algorithms to identify characters and create a text file based on the image. OCR is powerful, but it is also prone to errors such as misidentifying characters (e.g. reading a zero as the letter 'O') or adding or removing spaces. There are additional challenges for scanning older English texts, such as those that use the long 's' ('ſ'), which resembles a lowercase 'f'. We will cover this more below in the section on Cleaning, but for now it is sufficient to know that this process can often leave errors in the text files produced through OCR.
Let's get started!
Access
-
To access the DSL through our U of T institutional connection, go to https://uoft.me/gale.
(You can also access the DSL through the library catalog.) -
The website will first prompt you for your University of Toronto login.
Enter your UTORid.
Depending on your settings and location, you may also have to log in with your University of Toronto Library barcode (located on your T-Card) and PIN, as you would to use other library services (e.g. renewing your loans). -
You will arrive at the DSL homepage, but you will need to log in before you can do anything. Currently, this service can only authenticate Google accounts. DSL doesn't use your Google account for authentication, but solely to track session information, so If you have any reservations about using your personal Gmail account, feel free to make a new account just for DSL. Gale’s privacy policy states:
When logging in to the Gale Digital Scholar Lab App using either Google or Microsoft, the App accesses the user’s Google Drive or Microsoft OneDrive through an anonymous access token that is created when users first log in. This anonymous token is generated in order to connect users to the content and analysis they create in the Digital Scholar Lab. The App does not collect, read, access, or store any of the data from a user’s Google Drive or Microsoft OneDrive account(s), nor does it access any open documents. In addition, the App does not access or share personal information as part of this process.
You can also email privacy@cengage.com for more details.
If this is acceptable to you, click on Log In / Create Account,
choose Google, and then log in with a Google account.
-
When you return to the Digital Scholar Lab homepage, you should notice your name in the top-right corner, under Signed in. The central bar has changed from a login prompt to a search bar. You're in!
Collection
The DSL has access to Gale's extensive digital collections of newspapers, books, and other archival material. In this tutorial, we will focus on texts by and about an early twentieth-century scholar named Sir Aurel Stein, who travelled throughout South and Central Asia and China as an agent of the British Empire. There are many materials related to him - works by him, newspaper articles about his expeditions, etc. - available in the DSL, mostly ranging from the 1920s to the 1940 from a variety of newspapers (The London Times, Illustrated London News, The Daily Mail, The New York Times) and other digitised holdings from libraries and archives (the British Library, the Smithsonian, the American Antiquarian Society, the National Library of China).
-
Now that you’re logged in, you will need to create your corpus (the collection of texts on which you will run statistics on and create visualizations). Although you can use the toolbar in the centre of the homepage to get started with a basic search, let’s use the advanced search for a little more precision. Click on Advanced search under the search toolbar.
-
You will arrive at the Advanced Search page.
It is structured like a library catalogue search page. You can use Boolean operators like AND, OR, and NOT to create complex search queries. Gale also automatically suggests metadata (e.g. authorship, publication date, language) from their collection as you type. -
Note: this is the first screen where you can view DSL's learning videos. Feel free to watch these as you complete these steps to get an introduction to the features available on that page.
-
Let’s search for Aurel Stein as an Author. Begin by clicking on the dropdown menu on the right of the first field.
-
Change it from Keyword to Author.
-
Then, click in the search bar and type Aurel Stein.
-
Once you start typing his name, if you have “Author” selected as the search field, you will see at least four different variations of his name appear.
Note: DSL automatically suggests metadata like author names from its collections, but only if the correct field (e.g. author, keyword) is selected first. Because different databases have input Stein's names differently, there are four variations of his name (and thus four different "authors" when searching). Although this is not ideal, it is very common when working with data aggregated from many places. Let’s say that we want everything written by Stein, regardless of how his name appears in the various databases.
-
Select the first option, “Aurel Stein K. C. I. E.”.
-
Because are four variations of his name and we want texts that are attributed to any of them, we will use an OR operator. Click on the left dropdown menu and select OR.
Then, repeat the above steps, first changing Keyword to Author, then typing Aurel Stein, and selecting the next variation on his name. -
Since there are four variations of his name, we need more than three search terms. Click on Add a Row at the bottom to add a fourth row.
-
Once you are done, there should be four lines joined with OR. Click Search.
-
You now see the search results page.
-
For each result, you will see its title, its author, its OCR Confidence Percentage (see the note below), and a preview of the text.
-
Additionally, there is some metadata to the right, including the year of publication and the archive, source, and type of the text.
Note: OCR refers to Optical Character Recognition. It is a process whereby a program attempts to produce machine-readable text from a scanned document. The OCR Confidence percentage is an overall score that represents Gale’s confidence in the OCR quality of a specific text.
One factor in their confidence level is the specific OCR algorithm used, since newer OCR algorithms typically perform more accurately than older ones. According to their documentation, over the nearly 20 years Gale has been collecting and scanning documents they have used a variety of OCR algorithms. Gale DSL currently uses Adobe Acrobat with ABBYY5 to OCR scanned documents.
The OCR Confidence percentage additionally relies on other factors, such as the condition of the original document, the quality of the scan, what kind of text is featured in the document, and whether or not there are images in a document.
A caution: the confidence level is useful but not perfect, as some documents can have a lower confidence than they deserve (if they feature lots of images), and some documents with high confidence can still include OCR mistakes. In other words, the confidence percentage is not a replacement for human eyes. Some of the oldest scans won’t have a confidence percentage at all, since they predate that system.
-
Let’s use all of these texts. Click Select All.
-
Then click Add to Content Set on the top right corner.
-
Select New Content Set.
-
Name it Stein and click Create. Close the notification that follows.
-
Now, let's add some additional texts to this corpus. On the main toolbar, click on Build.
-
Then, click on Advanced Search.
-
The first time you used Advanced Search, you found texts by Aurel Stein. Now, let's look for texts about him, by searching for his name in the Keyword. Just like with steps 3-7 above, type Aurel Stein into the search bar with Keyword selected this time, select one of the variations that occurs, and repeat with the next line.
-
Like before, add a fourth line to accommodate all four variations of his name.
-
Be sure to use OR to separate all rows.
-
Finally, click Search.
-
Now there are many more texts. The left sidebar menu offers ways of filtering this dataset.
-
Just to keep our English-focused analysis consistent, scroll down, and under Publication Languages, click on English.
-
Let's add the remaining documents to our collection. Scroll to the top of the page, click Select All.
-
Then, click Select All (295) Results.
Note: Gale constantly adds new items to its collections, so your number of results may differ from those shown here. -
Add this to the content set you just created.
-
Note that the DSL automatically avoids adding duplicates, notifying you that 282 (of 295) documents were added. Again, your number may be different. Click Close.
-
To view more information on a specific item, click it, and you will be taken to the Doc Explorer view. Click the first item, "Sir Aurel Stein and Central Asia."
-
You’ll see the document in its original context on the left, with the text highlighted and the keywords selected. You’ll also get the complete text file on the right.
-
By scrolling down, you can see the scanned text on the left, with its text highlighted in light blue, and the specific keywords ("Aurel Stein") highlighted in green.
-
Above, you have options to cite, download, email, and print the text.
If you click on “Learn how this text was created,” you’ll receive basic information on Gale’s OCR process, and a link to further documentation. -
Let’s take a look at our collection. Click on Manage, located at the top right of the screen. You’ll get an overview of where the texts in this corpus come from.
-
To return to the list of texts, click on the Documents button.
By clicking on Documents, you can view and manage the collection, removing texts if you wish. Having a collection means that you don’t need to rerun searches every time you log into the DSL; you can edit and refine your collections and rerun past searches.
You have now run two advanced searches, built a corpus, and learned how to examine individual documents as well as the collection overview. Now let's move on to cleaning, where you prepare your texts for analysis.
Note: as of June 2020, DSL has a beta feature allowing you to upload your own texts. While not necessary to complete this tutorial, you might wish to learn how to do this for your own research. Link to guide on uploading texts to DSL
Cleaning
-
Click on the Clean tab in the toolbar.
-
This is the Cleaning Configuration page, specifically the default configuration.
-
Cleaning configurations produce higher quality analysis and visualizations by removing errors and extraneous data. The default cleaning configuration is a good start for most projects, but based on previous testing, it leaves in a lot of junk data with our current collection. For better results, let's make our own cleaning configuration. Under Clean, click on "+ New Configuration."
-
Name it Stein and click Submit.
-
To make our corpus a little more useful, under Cleaning Configuration, check “Remove all extended ASCII characters”, “Remove all number characters”, “Remove all special characters”, and “Remove all punctuation”. Leave the other settings at their defaults.
Note: Extended ASCII characters are characters used in languages other than English (e.g. accented letters like é), as well as for some typesetting and mathematical uses. Since we’re exclusively working with modern English sources in this collection, extended ASCII characters will only appear as errors of the OCR process, and therefore excluding them will give us more meaningful results. Similarly, by excluding punctuation, numbers, and special characters, we will prevent the DSL from treating individual numbers or punctuation as words.If we leave punctuation in, the Ngram tool reveals that the most popular "word" is the period / fullstop, which is not very useful:
Don't be like this; be sure to exclude punctuation, numbers, and special characters! -
Also, when you create a new configuration, you need to set a list of stop words. Stop words are common words, like “a” and “you,” that we filter out before running analyses on our corpus. If we don’t exclude them, then it turns out that the most common word in almost every English corpus is “the.” Under Stop Words, click Choose a Starter List.
-
Select English, then click Select starter lists.
Note: you can select multiple languages, if you are working with a corpus that includes texts in multiple languages. We'll stick with just English for this collection. -
Your Stop Words list is now populated with the most common English words.
-
You can add words to your stop word list by typing them in. You must separate each word with paragraph returns (i.e. by hitting the Enter key). Add the following words to the stop word list by copying and pasting them into the list above the first word in the English stop word list ("a"):
b
c
d
e
f
g
h
j
k
l
m
n
op
q
r
s
t
u
v
w
x
y
z
th
pp
Sec
ii
iii
iv
vi
vii
vii
ix
xi
xii
xiii
xiv
xv
xvi
xvii
xviii
xix
xx
xxi
xxii
xxiii
xxiv
xxv
xxvi
xxvii
xxviii
xxix
xxxNo.
Sec.
Be sure to avoid erasing the English stop word list.Note: by adding these "words" to the list, which are a mix of abbreviations (like "Sec." for "section"), Roman numerals, and individual letters, we do two things: (1) we remove commonly used words that are related to the structure rather than the contents of the text, like "Sec." or the Roman numerals; and (2) we account for some OCR errors. Since the OCR process sometimes misreads a word, inserting a space where there should be none (e.g. misreading "apple" as "a pple"), single or paired letters appear as common "words" in some text collections. This isn't true for every collection, but it is true for this particular collection.
-
Cleaning configurations can also replace words. One use is to treat variations of a key word or phrase as a single word. To prevent the various tools from treating "Stein," "Aurel Stein," "Sir Aurel Stein," etc. as different words, scroll down to the Replacements section.
-
In the first row, under "Replace this...", type Sir Aurel Stein. On the same row, under "With this...", type Aurel Stein.
-
On the next row, replace Aurel with Aurel Stein. Repeat this process on the next row, replacing Sir Stein with Aurel Stein.
Now all of these variations on his name will show up as the same version, "Aurel Stein," in our future analysis and visualizations. -
Finally, click Save.
Note: when you work with your own projects, you might need to adjust your cleaning configuration several times in order to remove errors specific to your texts.
Tools
-
Now that we have our text collection and our cleaning configuration, let’s try some tools. To get started, click on the Analyze button in the toolbar.
-
Under the Content Set dropdown menu, click on Select Content Set.
Select Stein.
Now, click Add Tool. -
Here you can pick from Gale’s six tools. For this particular project, add all six tools: Document Clustering, Named Entity Recognition, Ngram, Parts of Speech Tagger, Sentiment Analysis, and Topic Modeling. This selection includes a mix of qualitative and quantitative tools.
-
Finally, click done.
-
The Analyze page now lists your chosen tools.
Note: once you run the various tools, you can return to this page to view your analyses and visualizations.
Document Clustering
-
On the Analyze page, under the Document Clustering tool, click View.
Note: while there is an option to run multiple (or all) tools simultaneously, without prior setup, this option would use the default cleaning configuration. Since we made a custom cleaning configuration, and so that you can see some of the options for each tool, we’ll run each tool individually. -
Just like with the cleaning configurations, you can customise your settings for each tool. Name the tool setup Stein Cluster, set the Cleaning Configuration to Stein, and set the number of clusters to 3.
-
Finally, click run.
-
The tool will then begin processing.
-
This analysis job is processed on Gale’s servers. Let’s wait a minute.
Note: every tool has an About icon, which gives you a popup with some basic information and a link to fuller documentation. All of the tools in the Digital Scholar Lab are free tools you can download and run on your computer (sometimes they run on their own, other tools like Document Clustering rely on a programming language like Python).
Click on the About button for information about how this tool works.
Note: the Document Clustering tool uses the Python programming language to run a Python library called SciKit-learn. It uses machine learning, specifically the k-means clustering algorithm, to group the documents based on their similarity in terms of vocabulary. If you want additional information, or if you want to try downloading and running the tool on your own computer, click on Learn More.
Click Ok to dismiss the popup. Then, Refresh the page (CTRL+R or F5 in Windows, Command+R on Macs.) -
Once the Run Status says Completed, click on the words Scatter Plot under the scatter plot icon. If, after refreshing, it still says Processing, wait a minute and refresh again.
-
This process should create a graph similar to this one:
There are hundreds of dots on a scatter plot, in a roughly triangular arrangement, clustered into three colours: black, orange, and green. Each dot is an individual text, and the more similar two texts are (in terms of their vocabulary), the closer their corresponding dots are. Each cluster is colour-coded and uses a different symbol for each point.
Note: Although there are scales on the X and Y axes, they have no meaningful numeric significance. Instead, they provide a reference for measuring distance or closeness between individual points. For more information on the mathematics of how this algorithm works, see this video on the k-means algorithm.
Mouse over one of the dots: a popup appears with the title of text you chose. Clicking will bring you to the Doc Explorer view of that text, but don't do that, as we want to look at the corpus as a whole.
Now, a word of caution: what looks similar to the algorithm might not be meaningful to humans. In our case, however, there are some meaningful differences that look to me a little bit like different genres. Mouse over several of the dots and see if a pattern emerges. It looks roughly like Cluster 1 consists of short notices and bulletins, including biographical details (Stein’s publications and obituary). Cluster 2 contains brief reports of his expeditions and discoveries. Cluster 3 seems to be reviews of his published books. Note also that a number of book advertisements with similar names (MacMillan and Co) are clustered together, showing that the clustering tool recognises that they are similar in terms of their vocabulary. -
For this scatter plot, you can download both the graph itself and the data used to generate it. On the toolbar, click Download.
-
On the Download Options window that appears, under Visualization download formats, select PNG and then click Download.
Named Entity Recognition
-
Let’s try our second tool. Click on Analyze.
Then under Named Entity Recognition, click view. -
The Named Entity Recognition tool will extract common and proper nouns from our texts, and then attempt to classify them as people, titles, places, political entities, and so on. Name our tool Stein NER, set the Cleaning Configuration to Stein, and click Run.
-
While you wait, feel free to click on the About button for information about how this tool works.
Note: Named Entity Recognition uses a tool called spaCy, which runs in the Python programming language. It breaks up sentences into their component words, and then classifies the nouns. By default, it uses an existing vocabulary, and then tries to predict from context what classes new nouns fall into. It then tags and colour-codes these categories, and tells us where these terms can be found in our corpus.
Click refresh and then click on Entities Found. -
The Named Entity Recognition tool displays the set of nouns with the highest count across your corpus. The algorithm has classified, or in some cases guessed, into which category each word falls. Looking over the first few entries, it looks fairly good: it has correctly identified that "Chinese", "British", "Indian", and "Soviet" refer to cultural groups, that Aurel Stein is a person, and that "today" is a date. There are some issues, likely due to the unfamiliar terms: “Kashgar” is actually a city, not a person, but most entries are correct. I doubt that Kashgar, a Central Asian city, appears in the tool's default dictionaries, so in this case the tool guessed that Kashgar was a person based on its usage in various sentences.
You can click on any entity for more information about where it appears in our corpus, and what other categories it also falls into. Let's try this.
Click on the first entry, Chinese.
Let's examine the results for this particular named entity.
At the top, "Chinese" is identified primarily as a Cultural Group, which means that in your corpus, "Chinese" most frequently refers to groups of Chinese people or officials. Below, under "Term also identified as...", "Chinese" is also identified less commonly as a language - the Chinese language - and as a person, or rather, as people described as Chinese. Because Stein was primarily travelling through western China, often dealing with Chinese officials and manuscripts in Chinese, it makes sense for it to be the most common entity and also for these three meanings to dominate.
In the second section of the popup window, there is a list of all documents in the corpus in which the term "Chinese" appears.
Click on "A Chinese Expedition Across the Pamirs and Hindukush A.D. 747."
Gale DSL will then present you with a view of the document, marked up with tags showing how it has identified each entity within the document.
Note that this version of the text has been filtered through the cleaning configuration, so many common words (e.g. "the") and punctuation are missing.
Now that you have seen how Named Entity Recognition works on a particular document, close this window by clicking on the X. -
You will now return to the popup for the term Chinese. Another feature available in the Digital Scholar Lab is to use the Named Entity Recognition tool to identify documents related to a single topic, and then create a new collection with some or all of the documents with that term. Let's say that you are interested specifically in Stein's relationships with the Chinese and want to create a sub-collection focusing on Chinese art and archaeology. Check the boxes next to "A Chinese Expedition Across the Pamirs and Hindukush A.D. 747," "Exploration in Central Asia," "Buddhist Paintings at the Festival of Empire," and "The Treasures of Asia." Then click on Add to Content Set.
You can either add these documents to an existing content set or create a new one. Select "New Content Set";
name it "Chinese art and archaeology". -
You have now created a new collection. Click on Add (next to the content set "Chinese art and archaeology") to add the selected documents to this collection. Now, close the notification and then close the window.
Note: you can check to see the new collection by clicking on My Content Sets. Named Entity Recognition provides an additional way of creating content sets outside of searches. Be sure to return to the Named Entity Recognition tool to continue with the next step. -
There are many categories that aren’t helpful right now, so let’s filter so that we just have categories we’re interested in. First, uncheck Entity Categories.
Then, check Geo-Political Entity, Person, and Cultural Group.
You will now have a list consisting of just those three categories.
Now we can begin to answer questions like, “Which cultural groups are mentioned most often in association with Aurel Stein?” or “Who associated with him? Who does he write about?” As with the Clustering tool, you can download the complete data for further analysis by clicking on the Download button.
Although there are no data visualizations, like graphs, you can download the data as either CSV (a spreadsheet) or JSON. Click on CSV and then Download.
Note: this dataset might need some cleaning, as you saw with the city of Kashgar being incorrectly interpreted as a person.
Ngram
-
One popular tool is Ngram (you might know of it through Google Ngram Viewer), which counts the frequency of specific words in a corpus. Click on Analyze and then click View under the Ngram tool.
-
You’ll see that the Ngram tool has a few settings. Let’s change the Name to Stein Ngram, the cleaning configuration to Stein; leave the Ngram sizes at their defaults (min 1, max 4); set Ngrams Occurence Threshold to 5; and return 75 ngrams. Let’s name our tool setup Stein Ngram. Finally, click run.
-
The Ngram tool counts the number of times a term or a phrase occurs in our corpus and then graphs it. The Ngram size option refers to the length of the term: 1 means a single word, whereas 3 means that it looks for three-word phrases. By default, the tool looks for single words or phrases up to four words long. Ngrams Occurrence Threshold is the minimum number of times a word or phrase must occur in the corpus to be part of the Ngram. By setting it to 5, we cut out words or phrases that occur just two or three times in our thousands of documents. Number of Ngrams Returned determines the number of terms that appear in our visualizations. The default is 1000 but we've reduced that to 75 for visualization purposes. Although you might want to know the top 1000 terms for your research, it can look a bit cluttered in a wordcloud. One key point is that this tool ignores any terms that are composed entirely of words from our stop word list. Once a minute has passed, refresh the page.
-
Now the word cloud and bar chart are both available. Click on the word cloud.
This image visualizes the most common words in the corpus, while accounting for our specific cleaning configurations and tool settings.
If you mouse over a term, you can see how many times it appeared.
Interpretation: the more frequently an ngram appears in the corpus, the larger it appears in this word cloud. As we saw with the Named Entity Recognition tool, Chinese (referring to Chinese culture, Chinese people, and the Chinese language) are major topics in writing by or about Stein. In addition, British, Majestys (i.e. "In His Majesty's Service"), Subject(s), and other terms referring to British imperialism from 1900-1940 appear. There are several terms related to specific places (e.g. Kashgar) and official positions (e.g. Taoyin, Amban) in China or the Chinese government. Since Stein recorded his travels, many words related to geography (route, river, stream, valley, pass, road, ground, bank, etc.) occur. "Russian" and "Soviet" are also major terms because the British Empire was competing at this time with Russia (both imperial and then later Soviet) for control over the territories between British India and Russian Central Asia. (For more information, look up The Great Game (in English) or the Tournament of Shadows (in Russian).) -
Word clouds are popular and it’s useful to be able to generate them. Click on Download.
Here you can download either the Word Cloud image or the underlying data from the Ngram tool (which lists the most frequent words and phrases in the corpus). Let's download the image as a PNG. Click on Term Frequency PNG and then Download.
Note: unfortunately, there is no way to control the position or colour of specific words - they appear to be randomly generated. By resizing your browser window, you resize (and thus randomly reposition) some of the words in the world cloud. -
Let’s switch to the bar chart. On the left sidebar menu, click on Bar Chart.
This provides another way of looking at word frequency. Here, we can see at a glance which words were most popular.
Parts of Speech Tagger
- The Parts of Speech Tagger compares the writing styles of different authors by counting their use of different parts of sentences, such as proper nouns and adjectives. Begin by clicking on Analyze in the top menu. If necessary, select Stein from the dropdown menu.
- Under the Parts of Speech Tagger tool, click View.
- The cleaning configuration we have created removes many parts of speech that this tool counts. In order to show how cleaning configurations affect this tool, and to better understand the various authors' styles, we'll run this tool twice, once without a cleaning configuration, and once with. Begin by naming this tool setup "Stein no cleaning". Change the cleaning configuration to None. Then finally, click Run.
- Wait for a minute, then refresh. Once the analysis is complete, click Series View.
- With the tool complete, DSL will produce a graph comparing the styles of several authors. Each tick on the X-axis represents a different part of speech - pronouns, proper nouns, adjectives, etc. - and the Y-axis represents their frequency. Differently coloured lines, with different symbols for each point, represent the various authors.
This is a useful start, but it is too cluttered. There are far too many authors to meaningfully parse; moreover, Aurel Stein doesn't appear on this graph! - On the left sidebar is a list of authors. Click on the coloured symbol next to the first ten authors to deactivate them.
All of the symbols and names should now be greyed out. - In the Author Filter, type "stein" (without quotation marks). Select "Aurel Stein.", which should be the first option. Leave off the others.
Note: Stein's name appears several times here because his name has been entered with variations in the documents in DSL. This tool treats each variation of his name as a different author. The variation we chose above is associated with the most texts. - After adding Aurel Stein, type "giles" in the Author Filter. Add "Lionel Giles and Lionel Giles;". Both authors will be added to the accompanying graph.
Note: Lionel Giles was a contemporary of Stein's, and was a scholar who worked at the British Library. His name appears twice because in one text, "Lionel Giles" and "Lionel Giles;" (note the added semicolon) are both added as authors. This was probably done so that either version would appear in searches, but it causes some confusion for this tool. - Type "mirsky" in the Author Filter and add Jeannette Mirsky.
Note: Jeanneatte Mirsky wrote a biography of Stein in 1977. Although she's writing about similar content as him, she wrote much closer to the present day, so any differences revealed by the Parts of Speech Tagger are likely due to their different styles rather than content. - Type "lattimore" in the Author Filter, and add Owen Lattimore.
Note: Owen Lattimore was a scholar of Asia, but his writing style differs markedly from Stein's.
Both Aurel Stein and Owen Lattimore are symbolized by the same symbol (a downward-pointing triangle) and the same shade of blue. Unfortunately, there is no way to adjust symbolization in the DSL, so this is an issue you should watch for. - Finally, delete all text from the Author Filter to see the complete legend.
The final version of this graph reveals some major differences in writing styles: Lattimore uses a much higher proportion of proper nouns than anyone else and Mirsky uses the highest proportion of adjectives. - Now, let's try rebuilding this chart, but with our existing cleaning configuration. In the top right corner, click on Tool Setup.
Above the existing tool setup, click on New tool setup. - Name this setup "Stein with stopwords". Change the cleaning configuration to Stein. Finally, click Run.
- Wait a minute, then refresh the page. When the run status is completed, click Series View.
- Just as before, use the Author Filter to add Aurel Stein, Lionel Giles, Jeannette Mirsky, and Owen Lattimore. Deselect all other authors.
Now there are practically no conjunctions, particles, adpositions, or pronouns. As a result, some of the other parts of speech take on proportionally more weight. It is up to you as a researcher to determine which cleaning configuration - or none at all - works best for your aims. When in doubt, using no cleaning configuration is best for the Parts of Speech Tagger tool.
That's it! You have learned how to use the Parts of Speech Tagger to begin comparing the styles of different authors, and you have learned what kind of impact the cleaning configuration has on your data.
Sentiment Analysis
-
Sentiment Analysis is a powerful tool for estimating the overall positive or negative emotional feelings of thousands of texts very quickly. At the top toolbar, click on Analyze. Then, scroll down to the Sentiment Analysis tool and click View.
-
In the Tool Setup for Sentiment Analysis, call the tool setting "Stein Sentiment", change the cleaning configuration to Stein, and finally, click Run.
-
As always, you can click on the About button to get more information.
Note: Sentiment Analysis works by analyzing the words in a sentence, and looking them up in a dictionary that has a positive or negative score for many words. For example, the word "good" has a score of (positive) 3, and the word "unhappy" has a score of -2. The tool sums all scored words in a document and DSL then groups documents by year and averages their scores to produce yearly scores. You can download the full word list and their associated scores from AFINN.
Warning: the sentiment analysis tool does not understand context or meaning. It cannot distinguish sarcastic statements from sincere ones and it will not recognize words not on its list. Furthermore, without additional coding, it does not recognize negations, e.g. that "not impressed" means roughly the same thing as "unimpressed." It also embeds certain cultural assumptions and values: one of the example phrases in the Python code used to run this tool is "Rainy day but still in a good mood," where "good" is +3 and "rainy" is -1, for a sum of 2 for this sentence. The speaker might actually enjoy the rain, but this tool cannot account for that. These criticisms do not mean that the tool is useless, but that it is most effective when dealing with a large number of relatively straightforward texts. Like all tools in the DSL, it can be powerful (giving a rough estimate of the sentiment of thousands of texts within a minute is beyond human ability) but you need to understand how it works (and where and when it does not work).
Refresh the page. Once the Run Status is Completed, click on Time Series under Results.
-
Let's take a look at the results:
The x-axis represents years, and the y-axis represents the average sentiment score of all texts for a given year. This means that each point is an average of all words in all of the texts for a given year. -
You can get more information about specific years by clicking on the associated point. Find the point for 1920 and click on it.
-
A popup will list all documents associated with a specific year. Some years have many associated documents, whereas others have only one. You can click on document titles to go to the page for that document. Once you are done, click Close.
-
Some years have extreme scores, either very high or very low. Click on the point for 2011, which is the rightmost point.
-
There is only a single document for 2011. In general, since DSL averages the scores of all texts for each year, the most extreme points (the highest and the lowest) often have only one text. Let's investigate further. Click on the document title, "Turkey." DSL will open the doc explorer view. This text actually has almost nothing to do with Stein - it mentions him once in passing - and instead is a rather negative review of a book. Return to the Sentiment Analysis tool and click Close.
-
We can have more useful results by removing documents like these. Let's create a new collection that is a subset of this one. On the top toolbar, click My Content Sets.
-
Click on New Content Set.
-
Name it Stein sentiment.
-
You should now see it in your list of content sets. Do not click it yet, though.
-
On the top toolbar, click Build. Then, on the Build screen, near Search, click Advanced Search.
-
Under Advanced Search, click Add a Row.
-
Keep all four rows set to the Keyword field. Change the operators to OR. For each of the four fields, type Aurel Stein (as with the first content set) and pick one of the four versions of his name for each row.
However, once you are done, do not click search. We are going to add more qualifications to remove problematic documents. -
Scroll down. Under More Options, and under Publication Year, pick Between.
For the first dropdown menu, type "1890" and press Enter. For the second year, type "1950" and press Enter. -
Under publication language, pick English. Then, click Search.
-
On the search results screen, underneath All Content, check Select All.
It will then say that "All 100 results on this page are selected." Click on "Select All (234) results." -
Now it should say, "234 results are selected." Make sure that the Active Content Set is set to "Stein sentiment".
-
Then, click the "Add to content set" button in the top.
There should be a popup that says "Added 234 document(s) to Stein sentiment." Click View Content Set. -
Since advertisements often include very positive language for marketing reasons, and since several of the advertisements in this collection only mention Stein briefly, we will improve our sentiment analysis by removing them.
Once you are in the Stein sentiment content set, click Documents.
On the left menu, under Document Type, click on Advertisement. -
Once you are on a screen with only the advertisements, under Documents, check Select all on page.
Then, on the top right of the page, click on Remove from Content Set.
There will be a popup that notifies you that these documents have been removed from Stein sentiment. Click Close. -
You've trimmed your content set down to something more focused. On the top toolbar, click on Analyze.
Change the content set to Stein sentiment if you are not there already. Then, click Add Tool.
Remember that since this is a new content set, you have to add tools again, but we'll only add Sentiment Analysis for this content set. We'll return to the previous content set for the next tool.
Scroll down, and add just the Sentiment Analysis tool.
At the bottom of the screen, click Done. -
Under Sentiment Analysis, click View.
-
First, name this tool setup "Stein sentiment 1890-1950 no ads". Second, set the cleaning configuration to Stein. Third, click Run.
Wait, refresh, and when the analysis is complete, click Time Series. -
On average, there are more now documents per year. After removing ads and texts from considerably after Stein's life, the remaining texts are more representative. Explore by clicking on the various points.
One year in particular stands out for having a large number of texts with a largely negative sentiment. Click on the point for the year 1931.
Stein had three well-regarded expeditions to western China and Central Asia. In 1931, though, his tentative fourth expedition was cancelled and he was expelled from China. Here you can see a large number of newspaper articles reporting on this event. Click on Close once you are done.
That's it for Sentiment Analysis! You learned how to use it, interpret the results, and how to use an advanced search to create a subset of your collection to improve the tool's results.
Topic Modeling
-
For the last tool we’ll look at today, let’s try Topic Modeling. Before we use this tool, we need to modify our Cleaning Configuration, because Topic Modeling is case sensitive. This means that it will consider capitalized words to be different from uncapitalized ones. At the top toolbar, click on Clean.
-
Click on Stein under Your Configurations.
-
Under Text Correction, check the box for All lower case. Then, click on Save As on the upper toolbar.
Name the new cleaning configuration "Stein lower case", then click Submit. -
Now, let's use this new cleaning configuration in the Topic Modeling tool. Click on Analyze (top toolbar) to get back to the analysis menu, and change the Content Set back to Stein (if it isn't already).
-
Then, under Topic Modeling, click on View to set up this tool.
-
Name this tool "Stein Topics", set the Cleaning Configuration to Stein lower case, and increase the Number of Topics to 20. Leave Words per Topic and Iterations at their defaults. Click run.
-
Once a minute has passed, refresh the page.
Note: The Topic Modeling tool uses a free program called MALLET, which stands for Machine Learning for Language Toolkit, which itself uses an algorithm called Latent Dirichlet Allocation, or LDA for short. Basically, it looks for words that occur together often in the corpus, and then brings them together as a “topic.” This tool uses a certain degree of randomness, which is offset by running the tool many times in the background - this is what the Number of Iterations refers to. If you would like to install MALLET on your own computer, use this MALLET installation guide and tutorial. -
Once the analysis is done, click on Topic overview.
-
You will have twenty topics, numbered from 0 to 19. Each lists the words that appear together most frequently along with some summary statistics about each word (its count, probability of appearing, and number of documents in which it appears).
If you scroll through the list, most topics should look like they refer to similar topics or themes. There’s a bit of randomization in this tool, so your topics and their numbering will vary from those in these examples, but the large number of iterations ensures that most of the time, the tool produces topics that are fairly similar to each other. Here, my topic 12's words are river, route, road, pass, valley, small, bank, water, track, and camp. They appear to have a strong focus on geography and travel. Knowing in which documents a topic appears can be helpful for naming a topic, so for any topic, under Identified In, click on the number of documents.
You will then have a list of all documents in which this topic appears, which is helpful for determining the genres and content of this topic. As you might have suspected, my topic 12 has a number of documents related to both Stein's travels and those of others (e.g. Alexander the Great), along with texts like "Ancient Ways in Iran" and "Routes in Sinkiang" (i.e. modern Xinjiang Province in western China).
Click on the X to close this popup. -
Now that we have confirmed this topic's strong focus on geographic texts, let's name it Geography. Click on the topic's title and rename it "Geography".
Then, click Save. -
Go through the list and give titles to each of the topics. You should be able to guess at the general theme of each topic, especially if you click on Identified In... X Documents and consider which documents appear in each topic.
Another example is my topic 9, which includes words like "Soviet," "bureau," and "intelligence." These suggest the British government's involvement in the Great Game, particularly intelligence-gathering, so I named this topic "Intelligence Gathering."
Note: it is very likely that one or two of your topics resembles my topic 6, with non-English words like "gyappa" and "nangwa" included alongside the word Tibetan. If you click on "Identified in: 2 Documents," you will likely see two English-language textbooks for learning Tibetan. These topics thus include a number of Tibetan words, grammatical terms, and words like "tea" that are occur frequently in the example sentences in these books. Feel free to name this topic Tibetan.
If there are any topics which are completely unclear, feel free to leave them untitled.
Note: clicking on the Download button here allows you to download the data for each of these topics. If you are in another view, such as the Topic Comparison or Topic Proportion views described below, you will have access to different download options. -
Now that we’ve seen roughly what words and themes our topics cover, let’s look at how common these themes are in our corpus. On the left sidebar, under Views, click on Topic Comparison.
-
You now can access a number of data visualizations, each one describing the relationship between the documents, topics, and words in your corpus.
The available topic measures are Tokens, Document Entropy, Average Word Length, Coherence, Uniform Distance, Corpus Distance, and Exclusivity. The tool defaults to Tokens.
Tokens measures how often words from specific topics appear in the entire corpus.
Document Entropy measures the probability that each topic appears in a randomly selected text.
Average Word Length measures word length, with longer words suggesting more specific (and therefore meaningful) topics.
Coherence measures the likelihood that words within a topic appear next to each other.
Uniform Distance suggests which topics are the most specific.
Corpus Distance measures the distance from words in a topic from the corpus as a whole, suggesting which topics are most distinct from the rest of the corpus.
Exclusivity measures how often the top words in a topic co-occur with top words in other topics. -
Switch to Exclusivity by clicking on the dropdown menu under Topic Comparison By, and select Exclusivity.
This is the resulting graph:
The Topic Modeling tool produces a graph. The highest point here, i.e. the topic that is the most exclusive, or has least in common with other topics, is for Tibetan Grammar, at 0.831 (out of a possible total of 1.0). Since the other topics tend to be about geography, politics, and archaeology, it makes sense that a topic about Tibetan grammar is distinct.
Note: if you click on the Download button when in the Topic Comparison view, you can either download the data about the measurement you are currently viewing or the graph of the results. -
If you find results like this useful, you should also know that there are some shortcuts: under Views, click on Topic overview.
Under each topic, it lists the various measurements described above. Each of these is a hyperlink, so if you click on Average Word Length, you will immediately be taken to the graph and data for Average Word Length. -
In addition to these bird’s eye views of our entire corpus, we can also see the topic breakdown for each text. At the top of the screen, click on Results, and change Topics to Topic Proportion.
-
Now you’ll see a colour-coded proportional bar graph for each text, showing what percent falls under each topic. One advantage of this viewer is that it displays, at a glance, whether, and how much, a given topic appears in select documents of interest.
You can click on individual topics to show just them. Try hovering over the first section in the first text. (For my corpus, that is the purple "Stein (general)" topic section in the text called Recent Literature.)
A popup appears showing you the percentage of that text composed of that topic. Click on the section.
The viewer will then only display that topic in all of the listed texts. -
Although this tool by default shows only the first fifteen documents in your corpus, you can also specify which documents you want to view. On the left sidebar, under Documents Displayed, click on Select Documents.
A popup will appear listing the documents currently displayed. Click on the search bar.
Let's look for texts related to Kashgar, a city in western China that Stein visited, which was also tied to local governance and travelers. Type Kashgar. You will see two texts appear. Check the boxes next to each text to include them, then click Done.
When you return to the Topic Proportion screen, both texts related to Kashgar are now at the bottom of the list. By hovering the mouse over their various sections, the tool will reveal which topics compose these texts.
Export
-
Let's conclude by exporting texts from Gale Digital Scholar Lab. You can use these texts for your own research and if you have downloaded any of the tools used above (e.g. MALLET for topic modeling or perhaps you are trying spaCy in Python), you can use these texts in those tools. Begin by clicking on My Content Sets on the top toolbar.
-
Then, from the list of Content Sets, click on Stein.
-
On the Stein Content Set overview, click on Download Content Set.
-
In the popup that appears, leave Cleaning Configuration at its default, None, and click Generate Download.
Note: you can download a maximum of 5000 documents at one time. The maximum size of a content set is 10 000 documents. -
Back on the Stein Content Set overview, the Download Content Set button has changed to Generating Download. This is because Gale's servers are preparing the texts for you. Wait a minute or two, then refresh the page.
-
Once the Generating Download button has changed to Content set download ready, click it.
Then, on the Download Content Set popup, click Download. -
Your texts will be bundled together in a file called download.zip.
If you are using a Windows computer, use an unzipping program like 7-zip (download here) to access the files. Right click on download.zip and select "Extract here".
If you are using a Mac, double click on download.zip.
When the folder has been unzipped, open it. -
You will now have a folder with a README.txt file and a folder called "original." Open the "original" folder.
Inside there is one file per text in your collection.
Open the text file titled _Sikandar__the_Great_FP1800972885.txt (with the underscore at the beginning; it should be the third text, alphabetically).
This text has some OCR errors throughout but is mostly legible.
You now have access to OCRed copies of all of the texts in your corpus. -
Now that you have access to the original texts, let's also download texts that have been cleaned using our cleaning configuration. Click on Content set download ready again, but this time, click on the dropdown menu under Cleaning Configuration.
From the list, choose Stein lower case.
Then, at the bottom of the window, click on Regenerate Download.
Wait a minute or two again, refresh, and download the new set.
Unzip the folder, open the folder called Clean, and then open the text titled XX again.
This is the same text, after being cleaned through the cleaning configuration. All stop words (e.g. "the") have been removed, as have punctuation, numbers, and special characters, and all words are in lower case. It is no longer a readable text for humans but it helps immensely when running tools like Ngrams or Topic Modeling (via MALLET) on your own computer. -
You can also download the metadata for your records. Click the Download Metadata button.
Then, click on the Download button.
You can download the metadata for up to 10 000 documents in your collection. You will receive the data as a .CSV file (which can be opened as a spreadsheet, e.g. in Excel).
That's it! You have now used Gale DSL to assemble a corpus of texts, create a unique cleaning configuration, use four of its tools, and export the resulting data, visualizations, and texts. If you have any questions about text mining or the DSL specifically, please feel free to contact Digital Scholarship Services.
Additional Training
You have now completed all steps that this tutorial covers, but here are some ways to get additional training:
See Which Texts are Available
To see which collections of primary sources are available for your research, follow these steps:
- Return to the DSL main page by clicking on the Digital Scholar Lab logo.
- Scroll down until you see the "Learn More About the Lab" menu.
- Under "What Texts are Available?", click View the Archives.
- You will be taken to a screen with a complete listing of all collections licensed to the University of Toronto. You can then filter specific items or search within specific collections.
Learning Center
To access the Learning Center, which includes additional documentation and videos, follow these steps:
- Return to the DSL main page by clicking on the Digital Scholar Lab logo.
- Scroll down until you see the "Learn More About the Lab" menu.
- Under "Dive Deeper Into How the Lab Works", click on Visit the Learning Center.
- You will be taken to a new page with a sidebar menu. In addition to being able to read documentation and watch videos on any step of using the Digital Scholar Lab, you will also be able to access the Frequently Asked Questions (FAQ), Glossary, User Guidelines, and the Privacy Policy.
Sample Projects
Gale's staff, which include a number of professors engaged in digital humanities research projects, created three sample projects for you to use. They provide pre-built collections and pre-run tools centred on specific themes. They also provide extensive documentation on how they constructed their projects, how they fine-tuned their cleaning configurations, and how to interpret their results, which can in turn help you develop your own projects.
- Return to the main page of the DSL by clicking on the Digital Scholar Lab logo.
- Scroll down until you see the "Learn More About the Lab" menu.
- Under "Try Out Sample Projects", click View Sample Projects.
- There are currently three projects. (Scroll down if you can't see all of them.) Click on the title of one that interests you.
- Once you are on the page for a specific project, you have three options:
- Click on "Get a copy" to copy the whole project, collections, cleaning configurations, and pre-run tools, to your account.
- Scroll down to see an overview of how Gale's staff created this project, step by step.
- Click on the "Thinking Critically Supplement PDF" to read the report providing greater detail on how the project was created, how to take it further, and what obstacles the staff ran into.
- Click on "Get a copy" to copy the whole project, collections, cleaning configurations, and pre-run tools, to your account.
Uploading Texts (Feature in Beta)
- Gale is currently testing a feature whereby you can upload texts of your own and analyse them in the DSL. Although this tool is still under construction (in Beta), it is too useful to pass up. Let's try it! Begin by clicking on Build in the top toolbar.
- On the right side, there is the Upload box. There are two ways to upload texts: by inputting text directly into DSL or by uploading multiple files simultaneously. We'll try both, but we'll begin with the simpler method. At the top of the Upload box, click on Text Entry.
- The Text Entry page has a number of fields. Only the Title and Text fields are necessary, but the other fields are useful, both because they help keep your texts organised and because some of the metadata, such as year and date, are necessary for certain tools. Begin by pasting the following text into the Title and Text fields:
Title:Preface
Text:In the introductory remarks prefixed to this Memoir I have endeavoured to indicate briefly the objects and methods which guided me in the surveys of my three Central-Asian journeys and in the preparation of the maps which contain their final cartographical record.
It only remains for me to acknowledge with gratitude my manifold obligations for the effective help which alone rendered possible the topographical tasks bound up with my explorations.
That I was able to plan and carry out those tasks was due to the fact that the Survey of India, accustomed ever since its inception to serve the interests of Help of^nney of geographical research, not only within the vast area forming its own sphere of activity but also beyond the borders of India, supported from the start my aims with the means best suited for them. In Chapter 1, dealing with the history of our surveys, I have had occasion fully to note the services rendered by the experienced Indians whom the various Surveyor Generals deputed with me, and the extent of the help which I received by the provision of instruments, equipment arul funds to meet the cost of their employment. To the Survey of India was due also the compilation and publication of the results brought back by our joint efforts from each successive journey.
The topographical results thus secured have not only helped me to make my journeys directly profitable for geographical study, they have also greatly facilitated my archaeological explorations in regions which, though largely desolate today in their physical aspects, have yet played a very important part in the history of Asia and its ancient civilizations. But apart from the gratitude I owe for this furtherance of my researches, the fact of my having been able to work in direct contact with the oldest of the scientific departments of India will always be remembered by me with deep satisfaction.
Ever since in 1899 the proposals for my first Central-Asian journey had received the Government of India's sanction, successive Surveyor Generals did .Surveyor^euorals tlieir best to facilitate the survey tasks of my expeditions. 1 still
think back gratefully to the very helpful advice and instruction by which the late Colonel St. Georg Gore, R.E., while at Calcutta during the cold weather of 1899-1900, showed his personal interest in the enterprise. His successor as Surveyor General, Colonel F. B. Long, R.E., was equally ready to meet my requests concerning the plans !. had formed for my second and much more extensive expedition of 1906-08.
But my heaviest debt of gratitude is due to Colonel Sir Sidney Burrard, RE.,
K.C.S.I., F.R.S., who as Superintendent of the Trigonometrical Survey SidneyCBnrrnrd since 1899 had direct charge of all arrangements for the survey work of my first and second expeditions, and who during his succeeding long term of office as Surveyor General was equally ready to extend to me unfailing support and guidance with regard to the third. Moreover quite as great a stimulus was the thought of his own lifelong devotion to the study of the geographical problems connected with innermost Asia and the great mountain systems which enclose it. I feel hence very grateful for being allowed to dedicate this record of our labours to Sir Sidney Burrard not merely as a most helpful friend and guide but also as a living embodiment of that spirit of scientific research which has never ceased to pervade the Survey of India since the days of Rennell, Lambton, and Everest.
Note: you have probably noticed that there are some misspellings and other errors in the text above. This is because this text was made by using an OCR algorithm on scanned images of pages. These OCR errors are, unfortunately, a common byproduct of this process, requiring considerable time to manually check for even a single text, let alone the hundreds or thousands of texts in a collection. (I removed a small number of illegal characters, including variations on the apostrophe, to make this text compatible with DSL's requirements.) This is also a fairly error-free example. I took this text from Stein's 1923 Memoir on Maps of Chinese Turkistan and Kansu, specifically this copy of Stein's text from the Internet Archive. - Now let's begin adding metadata.
For each of the following fields, add the following metadata:
Author: Aurel Stein K. C. I. E.
Publication Title: Memoir on Maps of Chinese Turkistan and Kansu
Publisher: Trigonometrical Survey Office
Document Type: Book Chapter
Language: English
Subject: Geography - Click on the Publication Date field.
When the calendar window pops up, click on the year. Type 1923 and select that year.
Next, from the month dropdown menu, select January. In the month view, click on the 1st.
Note: we only have the year of publication for this text, but DSL requires a full date, with day and month, for their data. For this reason, we added January 1st, even though in practice you can ignore the month and day. - This text should now have all of its metadata fields complete.
Click on the Create Document button at the bottom to upload the text.
A banner stating "Document Saved" should briefly appear.
Note: metadata is "data about data," or more specifically, it is information about a dataset (whether that dataset is a book, article, film, image, etc.) such as the author and publication year. It is used by several of DSL's tools (e.g. the Parts of Speech tagger compares the styles of different authors, so it needs texts to have their Author fields filled out). We used "Aurel Stein K. C. I. E." because spelling his name differently, even if it would be recognizable as the same person to a human, will be technically a different author for any computational tools. This particular formulation is one of the existing variations in DSL and is how Stein wrote his name on the title page, so it works best. - Now that this text has been uploaded, let's add it to our collection for further analysis. Click on Add to Content Set.
From the menu, select Stein.
Once DSL confirms that this text has been added to the Stein collection, click Close. - Let's see where we can review these uploaded texts. Click on Manage Uploads.
You will now be presented with a list of all texts you have uploaded. So far, there is only one, but we will add more soon. You can also use this page to add texts to Content Sets, Apply Metadata after uploading them, or to Delete them. Click on Build to explore another method of uploading texts. - First, download these two .txt files by right clicking on the links below and saving them each to your computer (selecting Save Target As or Save Link As):
download Introductory material
download Chapter 1, Section 1
In the Upload box, under Browse for Files to Upload, click Choose Files.
Navigate to where you downloaded these files, select both of them (hold down CTRL (Windows) or CMD (Mac) to select more than one item at a time), and click Open.
Note: the example above comes from a Mac computer. Your interface may look different. - DSL will inform you that two texts have been successfully uploaded. Check the box next to Successfully Uploaded (2). Then click on the Add Metadata button.
- In the Apply Metadata window, leave it set to the default radio button, Quick.
Then, add the metadata as before:
Author: Aurel Stein K. C. I. E.
Publication Date: 01/01/1923
Publication Title: Memoir on Maps of Chinese Turkistan and Kansu
Publisher: Trigonometrical Survey Office
Document Type: Book Chapter
Language: English
Subject: Geography
Finally, click Apply Metadata. - The metadata have now been applied, but the new texts have not yet been added to a collection. Check the box next to Successfully Uploaded (2), then click on Add to Content Set.
As with step 7 (above), select Stein.
Once both texts have been added, click Close. - One final method to apply metadata is to use a spreadsheet. This method is best if you are uploading multiple texts with different metadata. Begin by saving these texts to your computer:
download khotan.txt
download panoramas_introductory_note.txt
Then, in the Upload box, under Browse for Files to Upload, click Choose Files.
Select both these files and click Open. - As with before, click on Successfully Uploaded (2) and then on Add Metadata.
- In the Apply Metadata popup, under "Applying metadata to 2 file(s)," click on the Bulk radio button.
- Under step 1, click on Download Form.
- Depending on your internet browser settings, you might automatically download a file called metadata-template.csv. If you are prompted to download it, choose where you would like to download it and then click save.
- Open up the file in a program like Excel, Numbers, or OpenOffice Calc.
- By using a spreadsheet, you can upload metadata for multiple texts quickly, and unlike the Quick mode, you can give different metadata for each text. This method is especially useful if you have bibliographic information for many documents available from another source, like that which might be exported from bibliographic software like Zotero, Mendeley, or RefWorks.
Input the following data into the spreadsheet.
for khotan.txt
Title: Sand Buried Ruins of Khotan - Chapter IV
Author: Aurel Stein K. C. I. E.
Publication Date: 01/01/1904
Publication Title: Sand Buried Ruins of Khotan
Publisher: Hurst and Blackett
Document Type: Book Chapter
Language: English
Subject: Archaeologyfor panoramas_introductory_note.txt
Title: Mountain Panoramas from the Pamirs and Kwen Lun - Introductory Note
Author: Aurel Stein K. C. I. E.
Publication Date: 01/01/1908
Publication Title: Mountain Panoramas from the Pamirs and Kwen Lun
Publisher: Royal Geographical Society
Document Type: Book Chapter
Language: English
Subject: Geography
Important notes:
Do not change any information in the Document ID column, or DSL will be unable to process the data. The Document ID number is different every time you upload a file, even if you upload the same file more than once.
The Publication Date column only works if the date is in the format DD/MM/YYYY, like 01/01/1904 for the first of January, 1904. As above, you will have to add days and months for the upload to work, even if the original publication only had a year of publication.
The texts might be in a different order for you than I have listed them. Be sure to match the data to the right row or you'll end up with texts that have the wrong title and year. - Save the spreadsheet. Be sure to save it in .CSV format (which is the format it comes in). Some programs like Excel will ask you to save in another format, but you must save it as a .CSV for DSL to recognise it.
- In DSL, in the Apply Metadata window, in step 2, click on Browse.
- Navigate to where you have saved metadata-template.csv, select it, and click Open.
- Now that step 2 has changed from "No File Chosen" to "metadata-template.csv" click on Apply Metadata.
- You will then return to the Build page. Click on Manage Uploads.
The two newly added texts, with their metadata, will now be there. Check the boxes next to "Sand-Buried Ruins of Khotan" and "Mountain Panoramas..." and click on Add to Content Set.
Select Stein. Close the popup that appears. You're done! You've now uploaded several texts and added metadata in a variety of ways.
Contact
Remember that if you would like help or want to take any of the DSL's tools further in your own analysis, you can always contact Digital Scholarship Services.