Table of Contents
Access the High Performance Computing Environment
Accessing the High Performance Computing Environment
If working in high performance computing environment is new to you, we would recommend you attend SciNet workshops to learn more, especially their Intro to SciNet/Niagara/Mist workshop (run periodically) or watch a recording of a previous session.
But here are some steps to get you started:
- You do not need to install any programs or clients to access the environment from a Mac. Access is via Terminal.
- You will use an SSH key to connect. This requires some initial configuration, but once this is done it is both more secure and more convenient. If you have not already generated a key pair, instructions on how to do so can be found here. More detailed instructions are also available on the SciNet wiki. Remember, you'll need create a key-pair on any systems you intend to connect with!
- To login to the remote host, use this command in Terminal:
ssh -i .ssh/myprivatekeyname <computercanadausername>@niagara.scinet.utoronto.ca
. The system will prompt you to enter the passphrase for your key (Note,-i .ssh/myprivatekeyname
is only necessary if you are not using the default key filepath and filename. See complete SSH setup instructions here for more information). - You are now connected to the server
- To log out, type
exit
and press Enter. You are now back in your local environment.
Query the Database via SQL
If SQL is a new concept for you, we would first suggest you learn the basics through a tutorial, such as this one from Tutorial Republic. You may also want to explore the PostgreSQL documentation to help you with your work.
- Once logged in as described earlier, at the prompt, type
module load postgresql
(and press Enter after this and any commands you type into the command prompt going forward)- Note there will be no visible change or messages after running some of these commands
- Type
psql -d cipo
to start up the command-line interface to PostgreSQL and be able to query the Canadian Intellectual Property Office (CIPO) Patent Database - Type
\?
for help with psql commands (whenever you see “—More—” at the bottom of the screen, press the space bar to page through the information) - Let’s try a few of these psql commands. Type
\x
to have a nicer expanded display of the outputs/results - Type
\dt
to see a list of all the tables available to you. To learn more about these tables, you can also consult the documentation - You can type \d <tablename> to display columns for a particular table. Type
\d title
to see all the columns for the title table, for example - Once you have a better understanding of the database’s organization into tables, you can type SQL statements ending with a semi-colon to query the database and see the results. Remember, you can page through results with the spacebar
- Before you begin, you might want to type
\h
for help with SQL commands. This will list all of the SQL commands available. If you want to learn more about a particular command, such as SELECT, type\h select
- Let’s try out a few SQL examples relevant to this particular database. If you want to paste these long statements into Terminal, you can copy them directly from our code snippets.
- If you would like to limit your results to a random sampling (this will speed things up and is great for testing!) the following command can be added to the end of the examples below:
order by random() limit 10
- If you would like to limit your results to a random sampling (this will speed things up and is great for testing!) the following command can be added to the end of the examples below:
- You will know when it is done, as you will see results on the screen or the command prompt will reappear. If you want to re-run a command, a quick way to “re-type” it is to use the up and down arrow keys to cycle through previously entered commands. If the results list seems too long to page through every item, type
q
to stop showing the results. - Important: When querying claim and disclosure tables, these tables should not be accessed by the print to screen statement below, as the large amount of raw text inside of them can break the display. This raw text is large blocks of text that contain HTML throughout. See # 14 below for more information on how to query these tables.
- Note: the queries below are searching for English language values. Each patent document contains a title and an abstract in both English and French. For other fields, values may be in either English or French, and sometimes both.
- Search by Title:
Let’s find patent documents that have the words “solar” and “power” OR "powered" in the title. (The % symbol is a wildcard symbol in SQL when using the ILIKE operator. This also makes your search case insensitive, compared to = which would look for an exact match). TypeSELECT * FROM title WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%';
- to limit your results to a random sample of 10:
SELECT * FROM title WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%' order by random() limit 10;
- to limit your results to a random sample of 10:
- Search by Title words and Document ID:
We can also limit searches based on multiple criteria for different fields. Let’s run the same search as above, but limit it to only publications with a patent number greater than 2 million (Note that document numbers increase over time, where newer patents will have higher document numbers).TypeSELECT * FROM title WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%' AND title.doc_id > 2000000;
- Search by Title words and Document ID, return specific fields only:
We have been selecting all the fields in the title table, but we can instead only pick the ones of interest. Let’s only output the patent document IDs. TypeSELECT title.doc_id FROM title WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%' AND title.doc_id > 2000000;
- Search by Title words and Document ID, but return abstract information as well:
So far these queries have focused on returning data from one table, but you can join tables to get information from multiple tables, such as title and abstract. Let’s run the same search from above, but also get the abstract information in the results. TypeSELECT * FROM title INNER JOIN abstract ON title.doc_id=abstract.doc_id WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%' AND title.doc_id > 2000000;
(note: not all records will contain an abstract)
- Search by Title words and Document ID, but return Abstract and Inventor information as well:
You can join one table to more than one other table to pull in more information into your results. Let’s run the query from example d, but add the inventor information as well. TypeSELECT * FROM title INNER JOIN abstract ON title.doc_id=abstract.doc_id INNER JOIN inventor ON title.doc_id=inventor.doc_id WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%' AND title.doc_id > 2000000;
(Note: This may result in document titles being duplicated if there are multiple inventors on a patent document)
- Search by Title words, Document ID and Inventor country - and return specific fields:
You can also limit searches based on information in multiple tables. Let’s run the same search from above, but also limit to inventors from Canada where the country field in the inventor table is equal to "CA". This time, we're also limiting the fields our query returns, similar to example c. TypeSELECT title.doc_id, title.text_en, title.text_fr, abstract.text_en, abstract.text_fr, inventor.name, inventor.country FROM title INNER JOIN abstract ON title.doc_id=abstract.doc_id INNER JOIN inventor ON title.doc_id=inventor.doc_id WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%' AND title.doc_id > 2000000 AND inventor.country = 'CA';
- Search by Year and PCT status, without returning this information in results:
This can continue to get more complicated. You might want to join a table in order to query a field, but aren’t interested in including the data from that table in the final results. The next example illustrates this and introduces you to nested SELECT statements (which will come in handy later). Often these SQL statements are best deciphered reading from the inside out (i.e., reading what is in the innermost parentheses first and working outwards) and breaking the statement into distinct parts. Note you could also construct this query similar to example f above, the only difference is that the columns from the source table are not included here, but are included in example f. For this example, let’s query the database for patent titles, document IDs and inventor names for patents with a Canadian inventor where the patents are PCT compliant (i.e. filed under the Patent Cooperation Treaty) and the application was filed in 2019 or later. These last bits of information we'll need to verify using the document table. Working inside out, this query finds all the patent documents IDs where the value of pct in the documents table is "t" (this is a boolean field, with values of "t" (True) or "f" (False)), and the value of app_date is greater than 2019-01-01. It then uses these document IDs to subset further, pulling only the information we requested for our final results. None of the information from the document table is returned in those results. TypeSELECT title.doc_id, title.text_en, inventor.name FROM title INNER JOIN inventor ON title.doc_id = inventor.doc_id WHERE inventor.country = 'CA' AND title.doc_id IN (SELECT title.doc_id FROM title INNER JOIN document ON title.doc_id = document.id WHERE document.pct = 't' AND document.app_date > '2019-01-01');
- Search for patents that are related to a subset of other patents:
CIPO data provides parent information. This means that many component patents may be grouped under a single parent patent, or a patent may represent an improvement to an original parent patent. Let’s query the database to find all the patents that have a small subset of patents as parent documents. The subset is similar to example b above, where we'll find all articles that have the word "solar" in the title, but this time we want to retrieve the parent document title information. This could be useful if you want to find parent documents that are related to the development of solar patents, but don't necessary have 'solar' in their name. This query is also similar to example g in that is uses a nested approach. Working inside out, this query finds all the patent documents IDs that match our search, and grabs their parent patent document IDs. It then pulls title table information for these parent patents. These types of queries are intensive and can take a while to run, so this is a very simple and small example to get you started. TypeSELECT * FROM title WHERE title.doc_id IN (SELECT parent_document.parent_app_num FROM parent_document INNER JOIN title ON parent_document.doc_id = title.doc_id WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%');
- Search by Title:
- After running some test SQL statements, if you are happy with the statement, instead of displaying the results on the screen, you can save them to a CSV file instead (unless you are querying claims or disclosure tables - see #14 below). If you want to save the results of our first SQL statement example above to a file called myfirstciporesult.csv, type
\copy (SELECT * FROM title WHERE title.text_en ILIKE '%solar%' AND title.text_en ILIKE '%power%') TO 'mycipoqueryresult.csv' CSV HEADER;
(Notes: You don’t need a semi-colon after the SQL statement in this case, but at the end of the copy command instead. Also, make sure that there is a space between \copy and the open parenthesis. When the command is finished, you will see the word COPY, followed by the number of results.)
- Like when querying claim and disclosure tables, the large amount of raw text inside of these two tables can make it difficult to retrieve and view the data as a CSV. Please use PostgreSQL XML functions to access these data, and save any results as XML documents. For example:
\copy (SELECT * from disclosure where disclosure.doc_id = 3125047) to 'output.xml'
- When you are finished querying the database and saving your results, type
\q
to quit the psql program
Throughout the examples, we have been using plain wildcard pattern matching, but you may want to explore more sophisticated ways to search text as well.
A Note on Query Efficiency: Generally, Postgres is really smart at analyzing what you want to do and querying the database in the most efficient way, so often changing the query structure won't make any difference because Postgres really does the same thing under the hood. One thing that sometimes helps is to increase the number of workers, for example with this command SET max_parallel_workers_per_gather = 16
, but not all operations in a query can be parallelized or parallelized well. Another thing that potentially helps for complex queries is to use temporary tables instead of table variables. For example, rewriting example g to use temporary tables - if this was a large query, it could speed that up from minutes to seconds!
Download the Results
- From the Niagara prompt, type
ls
to list all the files in your personal directory. If you followed the steps above, you should see a csv file you just saved - There are multiple ways to download your data file via Terminal, which are outlined in the SciNet Wiki. If you need to ensure that two datasets remain synchronized, you may want to use rsync or Globus. Otherwise, SFTP or SCP will work well to transfer files back and forth from your local machine to the remove environment.
- To download using SCP:
- Open a new Terminal window that is not connected to Niagara (ie. your local directory), and run the following command:
scp <computecanadausername>@niagara.scinet.utoronto.ca:[filename including extension] /some/local/directory
- For example, to download the files to a SciNet folder in your Documents:
scp doej@niagara.scinet.utoronto.ca:mycipoqueryresult.csv /Users/user/Documents/SciNet
- If prompted, enter your SSH key passphrase
- Note: if you received a permission denied error and are not prompted for your passphrase, try adding -i <privatekeyname> to the scp command:
scp -i <privatekeyname> <computecanadausername>@niagara.scinet.utoronto.ca:[filename including extension] /some/local/directory
- For example, to download the files to a SciNet folder in your Documents:
- Open a new Terminal window that is not connected to Niagara (ie. your local directory), and run the following command:
- You should see a progress bar, which will show 100% once the download is complete
- Now if you go to that directory, you should see your new csv file. Open it up and view your results!
Query the Database via Python
If you would like to programmatically construct your SQL statements (and programmatically manipulate the results), you may prefer to use Python code to query the database.
If Python is new for you, we would first suggest you learn the basics through a tutorial, such as this one from W3Schools. You can also consult our recorded workshop A Friendly Introduction to Python for Absolute Beginners: Part 1, as well as the Setup Instructions (includes how to get slides, workshop files, etc.) & Solutions (packaged in a zip file) for this workshop.
- In your favourite Python editor, write your script and save your file as a .py file. For this example, we will call it mycipopythonscript.py. Here is an example of a Python script that takes a list of companies and finds the patents that they own (download the script - note that you may have to right click to save the Python script instead of viewing the text in a browser tab). This script creates a temporary table of our companies, and then joins that table to the assignee table to find the companies' patent document IDs (this is more efficient than calling multiple SELECT statements in a loop, one for each company). Then this table is joined with the title table to find the patent titles. You will see that this script uses the psycopg2 package and provides information on how to connect to the database. You do not have to specify a username and password, as the system will automatically detect if you have permission. You can use this script and the SQL statement examples above, as guides to create your own Python code to query the database:
# You need a couple of packages to query the database and write a CSV file import psycopg2 import csv # You will need this database name and host information to create a connection to the database database_config = { 'dbname': 'cipo', 'host': 'idb1' } # This is a list of corporate entities we are searching for who are patent owners (assigneees). # Feel free to edit the names to find patent documents owned by companies you are interested in companies = [ 'MAZDA MOTOR CORPORATION', 'TOYOTA JIDOSHA KABUSHIKI KAISHA', 'SUBARU CORPORATION', ] # This section of code uses the psycopg2 package to connect to the database con = psycopg2.connect(**database_config) cur = con.cursor() # This executes a SQL statement that creates a temporary table with our list of corporate entities cur.execute('CREATE TEMPORARY TABLE _company (coname TEXT)') for coname in companies: cur.execute('INSERT INTO _company VALUES (%s)', (coname,)) # This SQL statement joins our list of corporations with the database assigneee table to filter the results # to only the companies we are looking for. This is a more efficient approach than looping through # companies and running multiple SELECT statements cur.execute("SELECT doc_id, name FROM assignee INNER JOIN _company ON assignee.name \ ILIKE '%'||_company.coname||'%'") # This next section of code goes line by line through the results and adds them to a dictionary # data type in python, where the patent document id is the key and the name of the company # is the value. It also prints it out so you can see the data. mylist = dict() while result := cur.fetchone(): print(result) mylist[result[0]] = result[1] # This next section sets up a CSV that we will use to store the results of our final query with open('mycipopythonresults.csv', mode='w', encoding='UTF8', newline='') as csv_file: myheader = ['id','title'] writer = csv.writer(csv_file) writer.writerow(myheader) # This section goes through each item in the dictionary that we created earlier. For each key(docID) # it queries the database to find the title information of their patents. Then it writes that # information into the CSV file. It also prints it out so you can see the data. for x in mylist.keys(): cur.execute("SELECT doc_id, title.text_en, title.text_fr FROM title WHERE title.doc_id=%s", (x,)) while finalresult := cur.fetchone(): print(finalresult) writer.writerow(finalresult) # Finally, all the connections to the database are closed cur.close() con.close()
- Once your Python script is ready, use SCP to to upload from your local computer. This is the same for downloading (as described earlier), but the order of directories is reversed:
scp /your/local/directory/:[filename and extension] <computecanadausername>@niagara.scinet.utoronto.ca:/home/[firstinitialofyourlastname]/<computecanadausername>/<computecanadausername>.
Note: If you are not the Principal Investigator ie. your account was sponsored by another user, you'll need to substitute that person's username in place of the second <computecanadausername>, as well as their first initial in [firstinitialofyourlastname]. In this case:scp /your/local/directory/:[filename and extension] <computecanadausername>@niagara.scinet.utoronto.ca:/home/[firstinitialofyoursponsorslastname]/<sponsorscomputecanadausername>/<computecanadausername>
- For example:
scp /Users/user/Documents/SciNet/mycipopythonscript.py doej@niagara.scinet.utoronto.ca:/home/j/doej/doej
- For example, for a sponsored account (smithp sponsored by doej):
scp /Users/user/Documents/SciNet/mycipopythonscript.p smithp@niagara.scinet.utoronto.ca:/home/j/doej/smithp
- If prompted, enter your SSH key passphrase
- For example:
- Once your script has been uploaded, connect to Niagara as described earlier
- Next we need to set up the environment to run our Python script. Type
module load python/3.9.8
- Next type
virtualenv --system-site-packages myenv
- Next type
source myenv/bin/activate
- Finally type
pip install psycopg2-binary
- Once the package has installed, you are ready to run your Python script. Type
python mycipopythonscript.py
or substitute in the name of your Python script if you called it something else.
(Important Note: If querying is only a small part of the overall task, and the majority of computing effort is going into postprocessing the query results, for example, using natural language processing or graph analysis, to be done in parallel, then there are different ways to run your script that involve submitting it as a job to be run. Feel free to contact us for help.) - It may take a while to run, but when it is finished you will see the command prompt again, and now if type
ls
from the Niagara prompt, you should see a new CSV file created from the Python script. Download the file (as described earlier) and open up the file to see the results
These are just a few examples to help you get started, but of course there is much more you can do. If you have any questions, feel free to contact us.