OpenRefine Augmenting Activity 2: Using Reconciliation Services

To learn more about reconciliation services and how you can use them to augment your data, check out this link first: https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation Feel free to read just the introduction and “Basics” sections for now, and then to return to this resource later if you like.

 

We are going to try out a reconciliation service that comes with OpenRefine to connect with Wikidata. Wikidata is a free, secondary database, collecting structured data to provide support to its related websites, such as Wikipedia.

 

The goal of this activity is to reconcile the titles of books we have in our dataset with the Wikidata website. This means we can look up our books in the Wikidata database, link Wikidata pages with our books, and supplement our book information with data from this database. Here is an example of a Wikidata page for one of our books, Pride and Prejudice: https://www.wikidata.org/wiki/Q170583.

Note 1: Complete Augmenting activity 1 first before attempting this activity.

Note 2: In order to complete this activity, you need to be running the latest version of OpenRefine.

In this activity, you are going to:

  1. Reconcile (match) the titles of books with Wikidata
  2. Add new columns of data from Wikidata

 

  1. Reconcile (match) the titles of books with Wikidata
  1. Make sure that you are using the new project we created from the books.json file in activity 1.
  2. From the title column pull down menu, select reconcile->start reconciling.
    Drop down on title (specifically _ - title), go to Reconcile, click on Start reconciling...
  3. You should see the Wikidata service available. This screen is where you could also add links to other reconciliation services by clicking on the Add Standard Service button and providing a URL for the service. We are going to use the Wikidata service, so click on it in the list.
    Under Services, select Wikidata
  4. Here you are given a list of different types that your records could be. Let’s go with literary work. This means that we are matching our titles to literary work entities in the Wikidata database.
    Under the list of entities, select literary work (Q7725634)
  5. We can also provide other information to help with the matching using the list on the right. Next to our Full Author Name field, check the include check box and start typing the word “author” in the blank field next to it. You should see the option “author...main creator...” show up as you start typing it, so click on it to select it. Make sure to select to include it.
    Under "Also use relevant details from other columns", next to Author, check Include, then type author in the box next to it. Select the first item that pops up.
  6. Click on the Start Reconciling button at the bottom of the window.
  7. You should see that values in your title column have changed. If they are now written in blue that means that they matched. They are now actually hyperlinks – click on one to see the corresponding page in Wikidata.
    Some of the titles are now in dark blue, signifying that they are hyperlinked. If you click on one, it will lead you to the Wikidata entry on that work.
  8. Reconciling does not normally match everything perfectly. You should see that 2 new facets have been created on the left. The judgement facet tells you how many matched and how many didn’t match (we can ignore any blanks here). Click on the none option to see the records that didn’t match. The second facet is for the match score. If items have a high score, they are considered more likely to match. Let’s leave this facet alone for now.
    Two reconciliation facets. In the first, "none" is selected.
  9. Now we are just looking at the records that didn’t match. We will have to go through each one and make decisions to find matches manually.
  10. If the title is grey with no options below it other than to create new item, click on search for match. Try this with The Importance of Being Earnest title.
    Under The Importance of Being Earnest, click Search for match.
    Here you are presented with a search box – change what it is searching for, by erasing all the text after the word Earnest. Now it should be able to find some potential matches. You can hover over each option to find the one that makes sense to match – in this case the first one, a literary work. Select it and now it is matched and no longer showing up in our list of unmatched items.
    Select the first option that pops up after shortening the title.
  11. In some cases, there are some potential matches listed, with their match scores (out of 100) in parentheses next to it. You can click on each option to decide which is correct. In these situations, you are presented with a box with one checkmark and one with two checkmarks. If you only want to match this one record/row, select the box with one checkmark. If on the other hand we had multiple entries for this title in our list, we could select the second box with two checkmarks to match everywhere it is found in the list. We know that for our dataset, each book title is unique, so we can select the box with one checkmark next to the correct item, as appropriate. Each of the suggested matches is a link to a different Wikidata entry. Be careful when selecting, since the correct option is not always shown. When in doubt, use the search for match option.
    Demonstration of suggested matches underneath each title. Each suggested title is a link to a Wikidata entry. The suggestions do not always contain the correct match, so exercise caution, and use the Search for match option when in doubt.
  12. Sometimes the options are confusing or may not show up properly. In that case, again it is easier to use search for match as we did in our first example, instead of picking an option. Often the book can be found by simplifying the title (removing subtitles and beginning articles, such as “the”).
    The Search for match window. Remove "the" from "The Illiad." Then, select the first suggestion that pops up.
    Sometimes the item will be found right away in the search, even though it wasn’t found through the reconciliation process. These are just quirks of the system. So go through the unmatched records and find matches. When done, reset the facets.

 

  1. Add new columns of data from Wikidata
  1. Once you have your data matched to the Wikidata database, a benefit of reconciliation is that you can then easily add additional columns of data to augment your dataset. From the title column pull down menu, select edit columns->add column from reconciled values… (Note: This option only appears in more recent versions of OpenRefine).
    Drop down on title. Edit column, Add columns from reconciled values...
  2. From this window, you can click on items from the suggested properties window to add those columns of data to your dataset. For example, click on language of work or name. You should see a preview of the data on the right. (Note: There might not be any data for this property for all of your books. Some properties might even be blank for everything in your dataset). Once you have selected your additional properties, click on OK to add those columns to your dataset.
    Under suggested properties, select "language of work or name", then click OK.
  3. Not only can you select from the suggested properties list, but you can also search for a property. As you search, it should offer a suggestion underneath the search field. If you are unsure what properties are available, check out a sample Wikidata page to see. For example, click on the title for Pride and Prejudice to bring up its linked page.
    Clicking on the title of a work brings you to its Wikidata entry.
    Scroll down the page to see all the properties available. Let’s add one. From the title column pull down menu, select edit columns->add column from reconciled values… Search for “characters” in the search box at the top. This should add a column with data on the main characters of the book. You should see an example in the preview window on the right. Click on OK to add that column to your dataset.
    This is the Add columns from reconciled columns window. First, under Add Property, type characters. Second, select the first suggested option. Third, a preview should appear. Note that not all entries have associated characters, but most will. Finally, click OK.
  4. Try adding more columns from the Wikidata database either by adding suggested properties or by searching. If you add something by mistake, you are able to remove it from the preview window.

So you can see that although there is some manual work involved in using reconciliation services, they can be an easy way to normalize your data, and then augment it with additional information.

For more reconciliation services to try, check out this list: https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources

Now you’re ready for Activity 3!

Utilities: