Data Story Outline

Part 1. The Big Picture

In this first part, we answer the following questions:

  • What does the typical patent-holder look like today, and how has that evolved between today and the 1990's?
  • Is a migration of innovators through time visible in the data, e.g. a convergence towards certain innovation centers?
  • How has the number of assignees and inventors evolved in this period?

The goal here is to address the main issue that brought us to chose this subject for our data story, by clearly showing how innovation today transcends geographies.

Approach

This first part focuses on how data evolves through time from a geographic standpoint, and we decided that using maps to visualize this evolution would be the best approach. As we have access to clean longitude and latitude data (see the below section Data Gathering), using the Folium library, which has e.g. cluster-functionality using it's heat map plugin, was found to be the best way to quantify the magnitude of the networks in different geographic zones. One challenge was to show the evolution throughout the entire time period in an intuitive way. While we first thought about using Folium's layer control so that the user could manually toggle between years, for the purposes of telling a data story, we judged that it would make the story more fluid to use a timelapse. The method we used to realize this is detailed in the below section Visualizations.

A few time series were also plotted to provide a more quantitative perspective on the data, to be more certain about the conclusions we drew from the figures.

Part 2. Peeling back the layers

In this second part, we look at the following examples of innovation networks for specific patents held by the following specific assignees:

  • Tencent and Facebook: Comparison of very similar patent applications filed at around the same time. See Story Page
  • Rochester and Dresden Universities: Comparison of similar patents from academic institutions. See Story Page
  • Volkswagen AG, Siemens AG, and the US Navy: Comparison of patents related to autonomous vehicules between two private firms and a US government body. See Story Page
  • Samsung Electronics: Comparison between battery charging technology patents from the same firm, with one patent from 1995 and the other one from 2010. See Story Page

By looking at these networks, we answer the following questions:

  • If we take a few different types of companies / government bodies / academic institutions, and look at the network supporting some of their patents, what do these networks look like, in light of the conclusions from Part 1?
  • Within the same companies, how have their networks evolved between today and the 1990's, if we look at patents similar to those above?
  • Across companies of the same type, to what extent, if any, will their networks be similar?

Approach

We came to the conclusion that the most natural approach again here was to explore the data using maps to visualize the innovation networks. The specific patents which are the starting points for this analysis were chosen manually, using the Google Patents database to look for adequate patents to compare. We again made use of the the Folium library, though here we thought that the layer control would be more interesting, as we would only be looking at the 4-5 layers of patent citations supporting the chosen patents.

We thought about representing the networks using a graph-based approach, using e.g. the Python library networkx, though we felt like this was not adding more information than looking at the distribution of the inventor locations using Folium's heat map plugin.

For both parts 1 and 2, while the data gathering and data preprocessing were done mostly in parallel and in an interative fashion, for simplicity we nonetheless start by presenting the data gathering part and we then present the data preprocessing, both as in their final iteration.

Data Gathering

PatentsView Database

The database offers a wide range of features for all patents since 1976, which can be extracted through their API.

The most relevant documentation can be found here:

For each research question, we came up with a list of required output fields which the API calls needed to return, as well as a list of input filters that limit the amount of extra pre-processing we had to do, while at the same time providing us with all the information necessary for us to cover the topics listed in the first section.

The complete list of fields is as follows:

  • Part 1. For each patent in a given timeframe, we need:

    • cited_patent_number: Patent number of the cited patents.
    • inventor_latitude: Latitude of all the inventors as listed on the selected patent.
    • inventor_longitude: Longitude of all the inventors as listed on the selected patent.
    • patent_type: Category of patent (see below).
    • assignee_organization: Organization name, if assignee is organization.
    • assignee_type: Classification of assignee (see below).
  • For Part 2, the list is the same, with the exception of the last two items, which are not needed for the second part.

The assignee_type field allows us to have a good picture of the typical patent rights holder. The assignee_type categories are as follows:

  • '2': US company or corporation
  • '3': foreign company or corporation
  • '4': US individual
  • '5': foreign individual
  • '6': US government
  • '7': foreign government
  • '8': country government
  • '9': US state governement
  • '1x': part interest

As well, the reason for including the patent_type field is that we want to be able to distinguish between major patent categories:

  • 'Defensive Publication': "... an intellectual property strategy used to prevent another party from obtaining a patent on a product, apparatus or method for instance." ( wikipedia )
  • 'Design': "... legal protection granted to the ornamental design of a functional item" ( wikipedia )
  • 'Plant': covering any "new variety of plant" ( wikipedia )
  • 'Reissue': correction of "a significant error in an already issued patent" ( uslegal )
  • 'Statutory Invention Registration': "for publishing patent applications on which they no longer felt they could get patents" ( wikipedia )
  • 'Utility': patent for a "useful" patent ( wikipedia) )

Utility patents are the most relevant for us. They are patents protecting "any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof" ( U.S. Code § 101 ) We do chose to exclude the reissue patent type from our results, as they are not truly innovations, but only corrections on already issued patents. This category is thus omitted directly when calling the API.

We chose to query for the latitudes and longitudes instead of the location in the format city-state-country, as the former data are more useful for visualization purposes. Furthermore, during our preliminary data analysis, the data in the city-state-country format was not uniform (i.e., some cities were named in full, while others were abbreviated in different ways). On the other hand, the latitudes and longitudes data is uniform, and the missing data was also easier to clean than with names format.

We also needed to be able to query for patents based on the following filters:

  • patent_number: US Patent number, as assigned by USPTO.
  • app_date: Date the patent application was filed (filing date)

For the purposes of situating in time the patent data we have, we chose to consider the date at which the patent application was filed, instead of the date at which it was granted. The reasoning behind this choice is that at the time of the patent application, the innovation supporting it already exists. So while using the application date has reduced the amount of most-recent data (post 2014) we can work with, in our opinion it paints a more vivid picture of innovation. Furthermore, as our analysis is performed on data spanning almost 3 decades, we're confident this choice is not detremental to our story.

Working with the PatentsView API

The maximum number of results per query through the PatentsView API is 10,000 per page, capped at 10 pages. (In order to avoid hitting the limit, as a rule of thumb, if we are querying for all patent applications in a given date range, we need to keep the date range to about quarter of a year.) So while the PatentsView API made it easy to extract data, the usage limits meant we had to implement some functions to automate the incremental extraction and saving of all the data we needed.

The following implemented functions can be found in the pipeline.py module.

  • patentsviewAPI: puts together the query string, the output fields string and the options string, and then extracts and saves the data returned by the PatentsView API in json format. The following functions are called by patentsviewAPI:
    • query: Forms query filters string to pass into get_data.
    • get_data : Extract and save data from the PatentsView API.

The saved json data is of the following format:

  • page number in format '1', '2', ...
    • 'patents': list of patents in page. For each patent:
      • 'patent_type'
      • 'inventors' : list of inventors listed for the patent. For each inventor:
        • 'inventor_key_id'
        • 'inventor_latitude'
        • 'inventor_longitude'
      • 'assignees': list of assignees listed for the patent. For each assignee:
        • 'assignee_key_id'
        • 'assignee_organization'
        • 'assignee_type'
      • 'cited_patents' : list of other patents cited by the current patent. For each cited patent:
        • 'cited_patent_number'
    • 'count': number of results in the page
    • 'total_patent_count': total number of patents referenced in the results

Data Preprocessing

Part 1.

The following functions in the pipeline module are specifically for preprocessing the data required for Part 1:

  • load_data: converts the saved jsondata from the PatentsView API to the format that is most useful in answering the research topics stated in Part 1, by calling on the following functions:

    • get_full_year_data: Fetches data for a full year, one quarter at a time, to deal with the PatentsView limits.
    • preprocess_data: Preprocesses saved json data from file to the format used for the data analysis.
  • get_ts: Extracts time series data from the dataset. The extracted time series are as follows:

    • Number of patent applications for each year
    • Number of inventors listed in patent applications in each year
    • Number of patents cited in patent applications filed in each year
    • Number of utility patent applications in each year
    • Number of design patent applications in each year
    • Number of individuals listed as assignees in patent applications in each year
  • get_all_locations: Returns list of locations and the number of inventors at each location for the given data, in the format required by the Folium heatmap plugin.

  • get_top_k_locations: Returns list of locations and the number of inventors at each location, considering only the data from the top k assignees, ranked by number of patent applications.

  • get_assignee_ts: Returns the time series of the number of inventors for the specified assignees

The proprocessing is straightforward, as the PatentsView API returns data in json format. The preprocessed data (output of load_data) has the following structure:

  • for each YEAR
    • dataframe: proportion of patents in each patent type.
    • dataframe: unique list of all inventor locations. For each inventor location : the count of the number of inventors listed in patent applications that were based at that location at the time of the application.
    • int: total number of patent applications.
    • int: total number of citations made in all the patent applications.
    • int: total number of inventors listed in the patent applications.
    • dataframe: information on assignees. For each unique assignee:
      • name of the assignee, if an organization
      • type of the assignee (see PatentsView Database section, above)
      • counter containing the number of inventors based in each location
      • number of patent applications
      • number of citations made in patent applications

Missing Data and Inconsistencies

The preprocess_data function takes care of the missing data and inconsistencies that were found during the data exploration phase. These particularities are summarized as follows.

  • We discard the data belonging to assignees which have no assignee_id attributed.
  • We discard the data with NaN as assignee_type.
  • We do not count the cited patents which have no corresponding patent_number, which are a negligeable amount.
  • We discard data which have either None or '0.1' in either inventor_latitude or inventor_longitude, as this data is useless for the visualizations.

We show below that the amount is negligeable, as compared to the total amount of data.

Required imports for loading the data for Part 1.

from pipeline import load_data, get_ts

Define local data folder and data year range.

MY_PATH = '/media/dcm/HDD/ADA_DATA'
MIN_YEAR = 1990
MAX_YEAR = 2016
year_range = range(MIN_YEAR,MAX_YEAR + 1)

Here we load the full dataset. As the data is already all on disk, the data preprocessing takes less than an hour to complete. Running this next line for the full time range (1990-2016), if no data is yet saved to disk, will take a few hours.

full_year_data = load_data(year_range, MY_PATH)
full_year_data.keys()
dict_keys(['1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016'])
full_year_data['1990'].keys()
dict_keys(['num_by_patent_type', 'locations', 'num_patents', 'num_citations', 'num_inventors', 'assignees', 'discarded'])
full_year_data['1990']['locations'].sort_values(by=0, ascending=False).head()
0
index
(35.6895, 139.692) 8211
(35.4437, 139.638) 4507
(37.5833, 139.917) 3699
(34.6971, 135.498) 2682
(35.5298, 139.702) 1879
full_year_data['1990']['assignees'].sort_values(by='patents', ascending=False).head()
organization type inventors_loc patents citations
338389 Kabushiki Kaisha Toshiba 3 {(35.1814, 136.906): 4, (35.6895, 139.692): 21... 1121 1097
164304 Canon Kabushiki Kaisha 3 {(35.5298, 139.702): 141, (35.4437, 139.638): ... 1055 1047
299786 Hitachi, Ltd. 3 {(35.7225, 140.1): 3, (43.9333, 143.717): 6, (... 1038 1010
222962 Eastman Kodak Company 2 {(43.0987, -77.4419): 38, (42.9981, -78.1875):... 919 907
63181 International Business Machines Corporation 2 {(34.8667, 134.633): 2, (41.584, -73.8087): 8,... 899 887

Verifying data.

discarded_patents = 0
discarded_citations = 0
discarded_assignees = 0
discarded_inventors = 0

for year in year_range:
    discarded = full_year_data[str(year)]['discarded']
    discarded_patents += discarded[0]
    discarded_citations += discarded[1]
    discarded_assignees += discarded[2]
    discarded_inventors += discarded[3]
discarded_patents, discarded_citations, discarded_assignees, discarded_inventors
(12842, 298046, 122, 83454)

Looking at the time series graphs in the data story, we conclude that these numbers are negligeable compared to the total amount of data. To be sure, we need to take into account that the above numbers are for all the years, so if one same assignee is discarded each year, it would be counted 27 times. In the data we do use, we have around 150k patents each year, 1.25M citations each year, 20k assignees and 300k inventors. All these numbers multiplied by 27 years make the above numbers negligeable.

Part 2.

The following functions in the pipeline module are specifically for preprocessing the data required for Part 2:

  • load_layers_data: Loads data from disk and converts the required data from json to dataframe.
    • get_layers_data: Fetches all data, one layer at a time.
      • get_cited_patents_data: Fetches the data for the given list of patent_numbers from the PatentsView API.
      • preprocess_layer_data: Preprocesses saved json data from file to the format used for the data analysis.

The preprocessed data (output of load_layers_data) has the following structure:

  • for each LAYER:
    • a list of the patents cited by the patents in that layer
    • a unique list of inventors and their locations for the patents in that layer
from pipeline import load_layers_data

Comparison between Tencent and Facebook. Both patent applications were filed in 2013, and both are for a display screen with a GUI.

facebook = load_layers_data(filename = 'facebook', patent_number = ['D725666'], layers = 4, data_dir = MY_PATH)
/media/dcm/HDD/ADA_DATA facebook_layer0.json
already on file
/media/dcm/HDD/ADA_DATA facebook_layer1.json
already on file
/media/dcm/HDD/ADA_DATA facebook_layer2.json
already on file
/media/dcm/HDD/ADA_DATA facebook_layer3.json
already on file
saving data
tencent = load_layers_data(filename = 'tencent', patent_number = ['D725670'], layers = 4, data_dir = MY_PATH)
/media/dcm/HDD/ADA_DATA tencent_layer0.json
already on file
/media/dcm/HDD/ADA_DATA tencent_layer1.json
already on file
/media/dcm/HDD/ADA_DATA tencent_layer2.json
already on file
/media/dcm/HDD/ADA_DATA tencent_layer3.json
already on file
saving data
len(tencent['0']['inventors']), len(tencent['3']['inventors'])
(6, 865)
len(facebook['0']['inventors']), len(facebook['3']['inventors'])
(5, 4528)

The remaining comparisons data was all obtained in the same fashion. We do not include all the code as it would be redundant in this notebook.

Visualizations

Part 1.

The following functions in the visualizations module are specifically for Part 1:

  • get_html: Produces HTML rendering of Folium map of the distribution of the inventors for the given type (All Assignees, All or top K US Assignees, or All or top K Non-US Assignees) for the given year, and saves it to working directory.

    • If viz is 'All' and K is None, produces a map showing the locations of all inventors
    • If viz is 'All' and K is k, produces a map showing the locations of inventors for the top k assignees, as well as csv file containing the list of of those assignees and their number of patents, and returns the total number of inventors for those assignees
    • If viz is 'US Assignees' or 'Non-US Assignees', K is not optional
  • get_timeseries_fig: Produces a figure with the following time series plots and saves to working directory:

    • Number of utility patents vs design patents vs other patents
    • Total number of inventors
    • Number of inventors for the top K Non-US/US Assignees
    • Total number of Citations
    • Total number of individuals as assignees
  • get_assignees_plot: Produces a figure comparing the time series plots of the number of inventors for two lists of assignees, and saves to working directory.

from visualizations import get_html, get_timeseries_fig, get_assignees_plot

The following lines produce the HTML figures for the maps in Part 1. In the data story, only the first map is a Leaflet, and the rest are timelapse GIFs which were manually created from these HTML files. To create these timelapses, we first converted the HTML files to PNG using image-online-convert (which was the only good-quality free service that did not impose any limits for this type of job). We then used the open-source image editor GIMP to create a timelapse of these images with the year shown.

# World View
for year in year_range:
    get_html(full_year_data,'All',year)
# Zoom on Asia
for year in year_range:
    get_html(full_year_data,'All',year, zoom_on = (25,120,3.5))
# Zoom on Europe
for year in year_range:
    get_html(full_year_data,'All',year, zoom_on = (53,18,3.5))
# Zoom on US
for year in year_range:
    get_html(full_year_data,'All',year, zoom_on = (40,-90,3.5))
# World View, Top 10 Assignees US and Non-US
num_inventors_top10_us = []
num_inventors_top10_nonus = []

for year in year_range:
    num_inventors_top10_us.append(get_html(full_year_data,'US Assignees',year, k = 10))
    num_inventors_top10_nonus.append(get_html(full_year_data,'Non-US Assignees',year, k = 10))
# time series figure
get_timeseries_fig(full_year_data, year_range, num_inventors_top10_us, num_inventors_top10_nonus)

To create the table containing the names of the top 10 assignees in for each year, we manually copy-pasted the lists for each year which is produced by get_html(), when passing an argument for K, into a single CSV file and we then converted that file to HTML using Pandas.

import pandas as pd
# top assignees tables
text_file = open("top_us_table.html", "w")
text_file.write(pd.read_csv(MY_PATH + '/Results/Top10_US.csv', index_col=0).to_html(index = False, 
                                                                                    col_space=200, 
                                                                                    justify = 'left'))
text_file.close()

text_file = open("top_nonus_table.html", "w")
text_file.write(pd.read_csv(MY_PATH + '/Results/Top10_NonUS.csv', index_col=0).to_html(index = False, 
                                                                                    col_space=200, 
                                                                                    justify = 'left'))
text_file.close()

For the last plot of Part 1, we arbitralily chose a handful of the top assignees to plot the evolution of their number of inventors.

# evolution in number of inventors for top assignees plot
assignees_us = ['Intel Corporation', 'Amazon Technologies, Inc.','International Business Machines Corporation',
                'Google Inc.', 'General Electric Company', 'QUALCOMM Incorporated', 'Apple Inc.']
assignees_nonus = ['Samsung Electronics Co., Ltd.', 'Canon Kabushiki Kaisha', 'Sony Côrporation',
                   'Taiwan Semiconductor Manufacturing Company, Ltd.','Hon Hai Precision Industry Co., Ltd.',
                   'LG Electronics Inc.','Huawei Technologies Co., Ltd.']
get_assignees_plot(full_year_data, assignees_us, assignees_nonus)

Part 2.

The following functions in the visualizations module are specifically for Part 2:

  • save_layers: Produce HTML rendering of Folium map with or without layer control.

    • If layered is True, then it produces a single HTML file with each Leaflet layer representing the locations of the inventors in each layer of innovators
    • If layered is False, then it produces one HTML file for each layer
  • compute_similarities: Produce HTML table containing the percentage of similarities in patent citations between two patent networks, at each layer.

from visualizations import save_layers, compute_similarities
save_layers(facebook, 'facebook', zoom_on = None, layered = True)
save_layers(tencent, 'tencent', zoom_on = None, layered = True)
compute_similarities(facebook, tencent, 'fb_tencent_sim')

Again, the remaining comparisons visualizations were all obtained in the same fashion, so we only include this example here.