Patents are the largest organized longitudinal collection of scientific, engineering, and technology information in the world. Despite all the vocal discourse on declining patent quality, US patents are the gold standard and remain a key to US competitiveness. Patents are classified by the inventions they describe. Patent classification is an important tool in understanding the scope and nature of invention, figuring out what is patentable and what is not, and learning how to build and use new things. Changing how patents are classified and the system for making classification choices is a formidable undertaking that will directly impact the is findability patents. Finding patents has a direct impact on patent quality.
The US Patent & Trademark Office (USPTO), in cooperation with the European Patent Office (EPO), is entering the final stage of full implementation of the new Cooperative Classification System (CPC). The CPC will replace the venerable US Patent Classification System (USPC) on January 1, 2015. The shift will have a broad impact in how people find patents, how academic research on innovation is conducted, how science and technology policy decisions are made.
High Level Changes
USPTO, in its two External User Days explained that there are about 260,000 breakdowns in the CPC vs. 150,000 class/subclass combinations in the USPC. The CPC will allow more granular classification because patent examiners (or the classification contractors who support them) will have more choices and can create more refined collections of symbols that define the invention in the patent.
The classification philosophy has shifted as well. The US Patent Classification selections focus on the content of the claims in light of the rest of the patent disclosure (the description, the drawings, and the abstract). The new Cooperative Patent Classification selections are based on the entire patent disclosure in light of the claims. A subtle but important difference that should result in more classifications or at least different classifications because the content subject to classification is broader.
The classification data has gone from a mostly numeric class/subclass pair called a class to an alphanumeric identifier called a symbol.
What isn't apparent yet is what this will mean in terms of the number and granularity of the classifications that will actually be assigned to patents. Will there be more classifications on patents now that there are more choices? Will the pool of classifications be more diverse or more granular? Classifications are arranged hierarchically. The system uses the concept of indents to create that hierarchy. If a classification is more indented it implies a more granular and specific designation of the invention.
Anecdotally it seems that there is an explosion of classification data showing up on patents. Patent 8,689,437 — Method for forming integrated circuit assembly, has five (5) US Patent Classifications (USPC), and two International Patent Classifications (IPC) but has 49 Cooperative Patent Classification (CPC) symbols. The statisitcal mapping from the old USPC classes to the new CPC symbols published by USPTO also suggests that there are more CPC symbols for a single USPC class/subclass pair. But it's not clear what this means. Rather than speculating we decided to figure it out. First we'll tackle the number of classifications.
This Week's Snapshot
On August 12, 2014 USPTO issued 1,993 utility patents that contained both USPC and CPC classifications (30.4%). The distribution of the number of classifications applied to these patents is identical for all practical purposes for both classification systems. The first looks at all classification data. The second looks at the data based on the Technology Center.
All Classification Data Compared
What We Found Here
Here is a tabular form of the accompanying box-and-whisker plots.
|Minimum (All patents have at least one classification.)||1||1|
This week, the numbers and distribution of the CPC classification is not significantly different than that of the USPC classification.
Classification Data By Technology Center Compared
Tech Center Breakdown
1600 Biotechnology and Organic Chemistry
1700 Chemical and Materials Engineering
2100 Computer Architecture, Software, and Information Security
2400 Computer Networks, Multiplex communication, Video Distribution, and Security
2800 Semiconductors, Electrical and Optical Systems and Components
3600 Transportation,Construction, Electronic Commerce, Agriculture
3700 Mechanical Engineering, Manufacturing, Products
What We Found Here
A feature of notched box-and-whisker plots is that non-overlapping notches in paired plots indicate that the difference in medians of the two groups of data is probably statistically significant. With this in mind, the median number of classifications for Tech Centers 1600, 1700, and 2100 are different, with the median number of CPC classifications greater than USPC in TCs 1600 (4 vs 3) and 1700 (5 vs 4), but USPC greater than CPC in TC 2100 (3 vs 2). The median values are either identical for all other TCs, or within 0.5 (TC 2400). The range of numbers of CPC classifications between the 25th and 75th percentiles is greater than that for the USPC in TCs 1600, 1700, 2800, and 3700. In other words, they show an overall greater range of the middle half of their distributions than do the USPC classifications on the same patents. Here is the table corresponding to the chart:
Based on the increase in the number of classification symbols available for classifying patents making the classifications more granular and classification based on the entire disclosure in the patent, we would expect to see a statistically significant greater number of CPC classifications than USPC on those patents that contain both. This is not the observable pattern for this week’s grants that contain classifications using both systems, and it is consistent with the pattern we have observed thus far. For this grant week, there is no statistical difference in the number of CPC vs USPC classifications applied to US utility patents. Stay tuned.
The Notched Box and Whisker Plot Explained
The notched box and whisker plot, developed by John Tukey, a mathematician and statistician from Bell Labs, is a way to visualize critical information about a dataset quickly. It is particularly good at displaying characteristics about the distribution of the data when you don't know what to expect. It's very useful for comparing two or more datasets.
How It Works
The lower and upper bounds of the box are the 25th (1st quartile) and 75th (3rd quartile) percentiles of the data. It covers 50% of the data. These are the values that 25 percent and 75 percent of the data values fall between.
The bar in the middle of the box is the median or 50th percentile — the value which splits the data set into two equal parts. The median is a better estimator of central tendency of a data set than is the average because the average is affected by the extreme values in the data. (Like chemical patents with 50 classifications)
The whiskers extending above and below the box show the limits of the 75th percentile plus 1.5 times the 'interquartile range' (IqR) (75th percentile value minus the 25th percentile value) (upper whisker) and the 25th percentile minus the 1.5 X IqR. These limits capture 95 percent of the data in a normally distributed data set.
Outliers, datapoints beyond these limits are depicted by dots, with the top dot representing the data maximum, and the lowest dot the data minimum. Patent classifications won't have lower dots unless a patent has NO classifications since every patent needs at least one classification.
The notch shows the 95 percent confidence interval around the median. A consequence of including the notch is that if the notches (i.e., the 95% confidence interval around the medians) of two data sets do not overlap, there is strong evidence that the distribution of the data sets is statistically significantly different. Formal statistical tests can be used to verify that difference.