Digital Scholar Lab - Collections

Return to the main Gale Digital Scholar Lab tutorial

This tutorial shows you how to create a collection in the Digital Scholar Lab, both by searching Gale Primary Sources and by uploading your own texts.

Table of Contents

Build a Collection

(back to table of contents)

The DSL has access to Gale's extensive digital collections of newspapers, books, and other archival material. In this tutorial, we will focus on texts by and about an early twentieth-century scholar named Sir Aurel Stein, who travelled throughout South and Central Asia and China as an agent of the British Empire. There are many materials related to him - works by him, newspaper articles about his expeditions, etc. - available in the DSL, mostly ranging from the 1920s to the 1940 from a variety of newspapers (The London TimesIllustrated London NewsThe Daily MailThe New York Times) and other digitised holdings from libraries and archives (the British Library, the Smithsonian, the American Antiquarian Society, the National Library of China).

  1. When you’re logged in, you will need to create your corpus (the collection of texts on which you will run statistics on and create visualizations). To access the search bar, click on Build in the top-right navigation bar.
    Build button on the DSL homepage
    Although you can use the search bar on the left of the page to get started with a basic search, let’s use the advanced search for a little more precision. Click on View all limiters in Advanced Search under the search bar.
    Click on Advanced Search
  2. You will arrive at the Advanced Search page.
    Advanced search page
    It is structured like a library catalogue search page. You can use Boolean operators like AND, OR, and NOT to create complex search queries. Gale also automatically suggests metadata (e.g. authorship, publication date, language) from their collection as you type.
  3. Note: this is the first screen where you can view DSL's learning videos. Feel free to watch these as you complete these steps to get an introduction to the features available on that page.
    First Learning Center video
  4. Let’s search for Aurel Stein as an Author. Begin by clicking on the dropdown menu on the right of the first field.
    Changing filter from Keyword to Author
  5. Change it from Keyword to Author.
    Select Author as search filter
  6. Then, click in the search bar and type Aurel Stein.
    Once you start typing his name, if you have “Author” selected as the search field, you will see at least four different variations of his name appear.
    Select the first option, “Aurel Stein K. C. I. E.”. 
    Select the first version of Aurel Stein
    Note: DSL automatically suggests metadata like author names from its collections, but only if the correct field (e.g. author, keyword) is selected first. Because different databases have input Stein's name differently, there are four variations of his name and thus four different "authors" when searching. Although this is not ideal, it is common when working with data aggregated from many places. Let’s say that we want everything written by Stein, regardless of how his name appears in the various databases.
  7. Because are four variations of his name and we want texts that are attributed to any of them, we will use an OR operator. Click on the left dropdown menu and select OR.
    Use OR operation

    Then, repeat the above steps, first changing Keyword to Author, then typing Aurel Stein, and selecting the next variation on his name.
  8. Since there are four variations of his name, we need more than three search terms. Click on Add a Row at the bottom to add a fourth row.
    Add a row

  9. Once you are done, there should be four lines joined with OR. Click Search.
    Click on Search

  10. You now see the search results page.
    Search result page


  11. For each result, you will see its title, its author, its OCR Confidence Percentage (see the note below), and a preview of the text.
    Basic information for each result
  12. Additionally, there is some metadata to the right, including the year of publication and the archive, source, and type of the text.
    Metadata for each result

    Note: OCR refers to Optical Character Recognition. It is a process whereby a program attempts to produce machine-readable text from a scanned document. The OCR Confidence percentage is an overall score that represents Gale’s confidence in the OCR quality of a specific text.

    One factor in their confidence level is the specific OCR algorithm used, since newer OCR algorithms typically perform more accurately than older ones. According to their documentation, over the nearly 20 years Gale has been collecting and scanning documents they have used a variety of OCR algorithms. Gale DSL currently uses Adobe Acrobat with ABBYY5 to OCR scanned documents.

    The OCR Confidence percentage additionally relies on other factors, such as the condition of the original document, the quality of the scan, what kind of text is featured in the document, and whether or not there are images in a document.

    A caution: the confidence level is useful but not perfect, as some documents can have a lower confidence than they deserve (if they feature lots of images), and some documents with high confidence can still include OCR mistakes. In other words, the confidence percentage is not a replacement for human eyes. Some of the oldest scans won’t have a confidence percentage at all, since they predate that system.

  13. Let’s use all of these texts. Click Select All. 
    Click on Select All

  14. Then click Add to Content Set on the top-right corner.
    Click Add to Content Set


  15. Select New Content Set.
    Select new content set


  16. Name it Stein and click Create. Close the notification that follows.
    Name the content set Stein

  17. Now, let's add some additional texts to this corpus. On the main toolbar, click on Build.
    Click on Build button


  18. Then, click on Advanced Search.
    Use advanced search


  19. The first time you used Advanced Search, you found texts by Aurel Stein. Now, let's look for texts about him, by searching for his name in the Keyword. Just like with steps 3-7 above, type Aurel Stein into the search bar with Keyword selected this time, select one of the variations that occurs, and repeat with the next line.
    Search for Aurel Stein as a keyword (not author)


  20. Like before, choose the three variations of his name. Be sure to use OR to separate all rows. And click on Search.
    Use OR to separate and search

  21. Now there are many more texts. The right sidebar menu offers ways of filtering this dataset. Just to keep our English-focused analysis consistent, under Publication Languages, click on English.
    Select Publication Languages filter
    Choose English as publication language

  22. Let's add the remaining documents to our collection. Scroll to the top of the page, click Select All. Then, click Select All (299) Results.
    Select all documents


  23. Add this to the content set you just created.
    Add documents to the Stein content set


  24. Note that the DSL automatically avoids adding duplicates, notifying you that 286 (of 299) documents were added. Again, your number may be different. Click Close.
    Documents added confirmation window


  25. To view more information on a specific item, click it, and you will be taken to the Doc Explorer view. Click the first item, "Sir Aurel Stein and Central Asia."
    Select the first document


  26. You’ll see the document in its original context on the left, with the text highlighted and the keywords selected. You’ll also get the complete text file on the right.
    Doc viewer, with newspaper scan central and machine-readable text right

  27. By scrolling down, you can see the scanned text on the left, with its text highlighted in light blue, and the specific keywords ("Aurel Stein") highlighted in green.
    Original document with highlighted name

  28. Above, you have options to cite, download, email, and print the text.
    Tool bars in the document page
    If you click on “Learn how this text was created,” you’ll receive basic information on Gale’s OCR process, and a link to further documentation.
    Link for Lean how this text was created

  29. Let’s take a look at our collection. Click on My Content Sets, then click on Stein. You’ll get an overview of where the texts in this corpus come from.
    Navigation bar
    The set Stein under My Content Sets

  30. To return to the list of texts, click on the Documents button.
    Click the document button to see document list

    By clicking on Documents, you can view and manage the collection, removing texts if you wish. Having a collection means that you don’t need to rerun searches every time you log into the DSL; you can edit and refine your collections and rerun past searches.


Upload your own texts

(back to table of contents)

  1. Gale is currently testing a feature whereby you can upload texts of your own and analyze them in the DSL. Let's try it! Begin by clicking on Build in the top toolbar.
    Click Build button

  2. On the right side, there is the Upload box. There are two ways to upload texts: by inputting text directly into DSL or by uploading multiple files simultaneously. We'll try both, but we'll begin with the simpler method. At the top of the Upload box, click on Text Entry.
    Select Text Entry

  3. The Text Entry page has a number of fields. Only the Title and Text fields are necessary, but the other fields are useful, both because they help keep your texts organised and because some of the metadata, such as year and date, are necessary for certain tools. Begin by pasting the following text into the Title and Text fields:
    In the introductory remarks prefixed to this Memoir I have endeavoured to indicate briefly the objects and methods which guided me in the surveys of my three Central-Asian journeys and in the preparation of the maps which contain their final cartographical record.
    It only remains for me to acknowledge with gratitude my manifold obligations for the effective help which alone rendered possible the topographical tasks bound up with my explorations.
    That I was able to plan and carry out those tasks was due to the fact that the Survey of India, accustomed ever since its inception to serve the interests of Help of^nney of geographical research, not only within the vast area forming its own sphere of activity but also beyond the borders of India, supported from the start my aims with the means best suited for them. In Chapter 1, dealing with the history of our surveys, I have had occasion fully to note the services rendered by the experienced Indians whom the various Surveyor Generals deputed with me, and the extent of the help which I received by the provision of instruments, equipment arul funds to meet the cost of their employment. To the Survey of India was due also the compilation and publication of the results brought back by our joint efforts from each successive journey.
    The topographical results thus secured have not only helped me to make my journeys directly profitable for geographical study, they have also greatly facilitated my archaeological explorations in regions which, though largely desolate today in their physical aspects, have yet played a very important part in the history of Asia and its ancient civilizations. But apart from the gratitude I owe for this furtherance of my researches, the fact of my having been able to work in direct contact with the oldest of the scientific departments of India will always be remembered by me with deep satisfaction.
    Ever since in 1899 the proposals for my first Central-Asian journey had received the Government of India's sanction, successive Surveyor Generals did .Surveyor^euorals tlieir best to facilitate the survey tasks of my expeditions. 1 still
    think back gratefully to the very helpful advice and instruction by which the late Colonel St. Georg Gore, R.E., while at Calcutta during the cold weather of 1899-1900, showed his personal interest in the enterprise. His successor as Surveyor General, Colonel F. B. Long, R.E., was equally ready to meet my requests concerning the plans !. had formed for my second and much more extensive expedition of 1906-08.
    But my heaviest debt of gratitude is due to Colonel Sir Sidney Burrard, RE.,
    K.C.S.I., F.R.S., who as Superintendent of the Trigonometrical Survey SidneyCBnrrnrd since 1899 had direct charge of all arrangements for the survey work of my first and second expeditions, and who during his succeeding long term of office as Surveyor General was equally ready to extend to me unfailing support and guidance with regard to the third. Moreover quite as great a stimulus was the thought of his own lifelong devotion to the study of the geographical problems connected with innermost Asia and the great mountain systems which enclose it. I feel hence very grateful for being allowed to dedicate this record of our labours to Sir Sidney Burrard not merely as a most helpful friend and guide but also as a living embodiment of that spirit of scientific research which has never ceased to pervade the Survey of India since the days of Rennell, Lambton, and Everest.
    Copy the Title and Text into boxes

    Note: you have probably noticed that there are some misspellings and other errors in the text above. This is because this text was made by using an OCR algorithm on scanned images of pages. These OCR errors are, unfortunately, a common byproduct of this process, requiring considerable time to manually check for even a single text, let alone the hundreds or thousands of texts in a collection. (I removed a small number of illegal characters, including variations on the apostrophe, to make this text compatible with DSL's requirements.) This is also a fairly error-free example. I took this text from Stein's 1923 Memoir on Maps of Chinese Turkistan and Kansu, specifically this copy of Stein's text from the Internet Archive.
  4. Now let's begin adding metadata.
    Metadata fields highlighted
    For each of the following fields, add the following metadata:
    Author: Aurel Stein K. C. I. E.
    Publication Title: Memoir on Maps of Chinese Turkistan and Kansu
    Publisher: Trigonometrical Survey Office
    Document Type: Chapter
    Language: English
    Subject: Geography
  5. Click on the Publication Date field.
    Click on the Publication Date field

    When the calendar window pops up, click on the year. Type 1923 and select that year.
    Select Year

    Next, from the month dropdown menu, select January. In the month view, click on the 1st. 
    Select Month and Date
    Note: we only have the year of publication for this text, but DSL requires a full date, with day and month, for their data. For this reason, we added January 1st, even though in practice you can ignore the month and day.
  6. This text should now have all of its metadata fields complete.
    Click on Create Document button
    Click on the Create Document button at the bottom to upload the text.
    A banner stating "Document Saved" should briefly appear.
    Note: metadata is "data about data," or more specifically, it is information about a dataset (whether that dataset is a book, article, film, image, etc.) such as the author and publication year. It is used by several of DSL's tools (e.g. the Parts of Speech tagger compares the styles of different authors, so it needs texts to have their Author fields filled out). We used "Aurel Stein K. C. I. E." because spelling his name differently, even if it would be recognizable as the same person to a human, will be technically a different author for any computational tools. This particular formulation is one of the existing variations in DSL and is how Stein wrote his name on the title page, so it works best.
  7. Now that this text has been uploaded, let's add it to our collection for further analysis. Click on Add to Content Set.
    Click on Add to Content Set
    From the menu, select Stein.
    Select Stein as the content set

    Once DSL confirms that this text has been added to the Stein collection, click Close.
    Close the pop up window

  8. Let's see where we can review these uploaded texts. Click on Manage Uploads.
    Click on Manage Uploads

    You will now be presented with a list of all texts you have uploaded. So far, there is only one, but we will add more soon. You can also use this page to add texts to Content Sets, Apply Metadata after uploading them, or to Delete them. Click on Build to explore another method of uploading texts.
    Click on Build

  9. First, download these two .txt files by right clicking on the links below and saving them each to your computer (selecting Save Target As or Save Link As):
    download Introductory material
    download Chapter 1, Section 1
    In the Upload box, under Browse for Files to Upload, click Browse.
    Click Browse to choose files

    Navigate to where you downloaded these files, select both of them (hold down CTRL (Windows) or CMD (Mac) to select more than one item at a time), and click Open.
    Select files from local folder

    Note: the example above comes from a Mac computer. Your interface may look different.
  10. DSL will inform you that two texts have been successfully uploaded. Check the box next to Successfully Uploaded (2). Then click on the Add Metadata button.
    Add metadata

  11. In the Apply Metadata window, leave it set to the default radio button, Quick.
    Select the Quick method

    Then, add the metadata as before: 
    Author: Aurel Stein K. C. I. E.
    Publication Date: 01/01/1923
    Publication Title: Memoir on Maps of Chinese Turkistan and Kansu
    Publisher: Trigonometrical Survey Office
    Document Type: Chapter
    Language: English
    Subject: Geography
    Enter all the metadata into blank boxes
    Finally, click Apply Metadata.
  12. The metadata have now been applied, but the new texts have not yet been added to a collection. Check the box next to Successfully Uploaded (2), then click on Add to Content Set.
    Add files to Content Set

    As with step 7 (above), select Stein.
    Select Stein

  13. One final method to apply metadata is to use a spreadsheet. This method is best if you are uploading multiple texts with different metadata. Begin by saving these texts to your computer:
    download khotan.txt
    download panoramas_introductory_note.txt
    Then, in the Upload box, under Browse for Files to Upload, click Browse.
    Click Browse to choose files

    Select both these files and click Open.
    Select files from local folder
  14. As with before, click on Successfully Uploaded (2) and then on Add Metadata.
    Add metadata

  15. In the Apply Metadata popup, under "Applying metadata to 2 file(s)," click on the Bulk radio button.
    Select the Bulk method

  16. Under step 1, click on Download Form.
    Click the Download Form button

  17. Depending on your internet browser settings, you might automatically download a file called metadata-template.csv. If you are prompted to download it, choose where you would like to download it and then click save.
    Save the template form

  18. Open up the file in a program like Excel, Numbers, or OpenOffice Calc.
    Excel interface, with document ID, title, author, and other metadata fields

  19. By using a spreadsheet, you can upload metadata for multiple texts quickly, and unlike the Quick mode, you can give different metadata for each text. This method is especially useful if you have bibliographic information for many documents available from another source, like that which might be exported from bibliographic software like Zotero, Mendeley, or RefWorks.
    Input the following data into the spreadsheet.
    for khotan.txt
    Title: Sand Buried Ruins of Khotan - Chapter IV
    Author: Aurel Stein K. C. I. E.
    Publication Date: 01/01/1904
    Publication Title: Sand Buried Ruins of Khotan
    Publisher: Hurst and Blackett
    Document Type: Chapter
    Language: English
    Subject: Archaeology

    for panoramas_introductory_note.txt
    Title: Mountain Panoramas from the Pamirs and Kwen Lun - Introductory Note
    Author: Aurel Stein K. C. I. E.
    Publication Date: 01/01/1908
    Publication Title: Mountain Panoramas from the Pamirs and Kwen Lun
    Publisher: Royal Geographical Society
    Document Type: Chapter
    Language: English
    Subject: Geography
    Excel with metadata entered, including author, publication information, and subject
    Important notes:
    Do not change any information in the Document ID column, or DSL will be unable to process the data. The Document ID number is different every time you upload a file, even if you upload the same file more than once.

    The Publication Date column only works if the date is in the format DD/MM/YYYY, like 01/01/1904 for the first of January, 1904. As above, you will have to add days and months for the upload to work, even if the original publication only had a year of publication.
    The texts might be in a different order for you than I have listed them. Be sure to match the data to the right row or you'll end up with texts that have the wrong title and year.

  20. Save the spreadsheet. Be sure to save it in .CSV format (which is the format it comes in). Some programs like Excel will ask you to save in another format, but you must save it as a .CSV for DSL to recognise it.
  21. In DSL, in the Apply Metadata window, in step 2, click on Browse.
    Click Choose Files button

  22. Navigate to where you have saved metadata-template.csv, select it, and click Open.
    Select the saved template

  23. Now that step 2 has changed from "No File Chosen" to "metadata-template.csv" click on Apply Metadata.
    Click on Apply Metadata

  24. You will then return to the Build page. Click on Manage Uploads.
    Click on Manage Uploads

    The two newly added texts, with their metadata, will now be there. Check the boxes next to "Sand-Buried Ruins of Khotan" and "Mountain Panoramas..." and click on Add to Content Set.
    Select files and add them to Content Set
    Select Stein. Close the popup that appears.
     You're done! You've now uploaded several texts and added metadata in a variety of ways.

That's it! You've now uploaded your own texts and located primary sources through advanced searching.

Proceed on to Cleaning
Return to the main Gale Digital Scholar Lab tutorial