Author: Jorik Blaas
Google Cloud Dataproc is the latest publicly accessible beta product in the Google Cloud Platform portfolio, giving users access to managed Hadoop and Apache Spark for at-scale analytics.
In real-life, many datasets are in a format that you cannot easily deal with directly. Patents are a typical example, they mix textual documents and some semi-structured data. The US Patent Office keeps a publicly available database with all patents, and this post shows how Apache Spark, hosted on Cloud Dataproc can be used to process this huge collection of data.
Setup
First of all, get gcloud up and running, you may need to update the binary to support the gcloud beta dataproc command by running gcloud components update.
Within Dataproc, you will have your own cluster to run spark jobs on. To provision this cluster, you can simply run
gcloud beta dataproc clusters create test-cluster
This will create a Cloud Dataproc cluster with 1 master node and 2 workers, each with 4 virtual CPUs. You can also use the web interface in the Developers Console (under Big Data -> Dataproc).
Basic setup
First of all, let’s submit a simple hello world to check that everything works, create a file hello.py, with the following content:
#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
print sc.version
gcloud beta dataproc jobs submit pyspark --cluster test-cluster hello.py
If all is well, you will see your python file being submitted into your cluster, and after a while you will see the output, showing you the Apache Spark version number.
Crunching the USPTO patent database
The uspto-pair repository contains a huge number of patent applications, and is publicly available as a Google cloud store (gs://uspto-pair/). However, each single patent is stored as a .zip file, and in order to transform this data into a tabular format, each file needs to be individually decompressed and read. This is a typical match for Spark’s bulk data processing capabilities.
The good news is that Dataproc integrates with Google Cloud Storage transparently, so you do not need to worry about transfers from Cloud Storage to your cluster’s HDFS filesystem.
Each File in the USPTO database is just a flat zip file, like so:
1610 2013-06-11T13:19:36Z gs://uspto-pair/applications/05900088.zip
Each of these ZIP files contains a number of TAB separated files:
Archive: 05900088.zip
Zip file size: 1610 bytes, number of entries: 4
?rw------- 2.0 unx 518 b- defN 13-Jun-11 13:19 05900088/README.txt
?rw------- 2.0 unx 139 b- defN 13-Jun-11 13:19 05900088/05900088-address_and_attorney_agent.tsv
?rw------- 2.0 unx 598 b- defN 13-Jun-11 13:19 05900088/05900088-application_data.tsv
?rw------- 2.0 unx 227 b- defN 13-Jun-11 13:19 05900088/05900088-transaction_history.tsv
4 files, 1482 bytes uncompressed, 992 bytes compressed: 33.1%
The main file of interest is the -application_data.tsv file, it contains a tab-delimited record with all key information for the patent:
Application Number
Filing or 371 (c) Date
Application Type
Examiner Name
Group Art Unit
Confirmation Number
Attorney Docket Number
Class / Subclass
First Named Inventor
Entity Status
Customer Number
Status
Status Date
Location
Location Date
Earliest Publication No
Earliest Publication Date
Patent Number
Issue Date of Patent
AIA (First Inventor to File)
Title of Invention |
|
05/900,088
04-26-1978
Utility
–
2504
3077
–
250/211.00J
JOSEPH BOREL (FR)
Undiscounted
–
Patented Case
03-06-1979
FILE REPOSITORY (FRANCONIA)
04-14-1995
–
–
4,143,266
03-06-1979
No
METHOD AND DEVICE FOR DETECTING RADIATIONS |
We will iterate through all of the zip files using Spark’s binaryFiles primitive, and for each file use python zipfile to get to the contents of the tsv file within. Each tsv will then be converted into a python dictionary with key/value pairs similar to the original structure.
We have written a script to map the zip files into data, basically defining the data transformation step from raw into structured data. Thanks to the brevity of PySpark code, we can list it here in full:
import pyspark
import zipfile
import cStringIO
sc = pyspark.SparkContext()
# utility function to process a binary zipfile and extract
# the content of -application_data.tsv from it
def getApplicationTSVFromZip(x):
fn,content = x
# open the zip file from memory
zip = zipfile.ZipFile(cStringIO.StringIO(content))
for f in zip.filelist:
# extract the first file of interest
if f.filename.endswith("-application_data.tsv"):
return zip.open(f).read()
# given the -application_data.tsv contents, return a dictionary
# with key/value pairs for the data contained in it
def TSVToRecords(x):
# tab separated records, with two columns, one record per line
lines = [_.split("\t", 2) for _ in x.split("\n")]
oklines = filter( lambda x: len(x)==2, lines )
# turn them into a key/value dictionary
d = {a: b for (a, b) in oklines if b != '-'}
return d
# load directly from the google cloud storage
# take all of the patents starting with number 0600....
q = sc.binaryFiles("gs://uspto-pair/applications/0600*")
d = q.map( getApplicationTSVFromZip ).map( TSVToRecords ).repartition(64)
d.saveAsPickleFile("gs://dataproc-UUID/uspto-out-0600")
We can convert the intermediary file (with all the pickled dictionaries):
import pyspark
import csv
import cStringIO
sc = pyspark.SparkContext()
def csvline2string(one_line_of_data):
si = cStringIO.StringIO()
cw = csv.writer(si)
cw.writerow(one_line_of_data)
return si.getvalue().strip('\r\n')
q = sc.pickleFile("gs://dataproc-33c9ac81-ff09-4b42-be89-57e56c695739-eu/uspto-out-0600-2")
keys = q.flatMap(lambda x: x.keys()).distinct().collect()
print keys
records = q.map(lambda x: [str(x.get(k) or '') for k in keys])
csvrecords = records.map( csvline2string )
csvrecords.saveAsTextFile("gs://dataproc-33c9ac81-ff09-4b42-be89-57e56c695739-eu/uspto-out-0600-csv")
The resulting csv file will contain one row for each patent:
Confirmation Number,Patent Number,Attorney Docket Number,Location,Class / Subclass,First Named Inventor,Group Art Unit,Status Date,Status,Issue Date of Patent,Filing or 371 (c) Date,Title of Invention,Entity Status,Location Date,Application Type,AIA (First Inventor to File
),Application Number,Examiner Name
7981,"4,320,273",I0105991,FILE REPOSITORY (FRANCONIA),219/999.999,MITSUYUKI KIUCHI (JP),2103,03-16-1982,Patented Case,03-16-1982,01-22-1979,APPARATUS FOR HEATING AN ELECTRICALLY CONDUCTIVE COOKING UTENSILE BY MAGNETIC INDUCTION,Undiscounted,06-10-1999,Utility,No,"06/005,57
4","LEUNG, PHILIP H"
7994,"4,245,042",NONE,FILE REPOSITORY (FRANCONIA),435/030,,1302,09-27-1980,Patented Case,01-13-1981,01-22-1979,DEVICE FOR HARVESTING CELL CULTURES,Undiscounted,06-10-1999,Utility,No,"06/005,587",
Now that all the patent meta-data has been extracted from the zip files, we can load it straight into SynerScope Marcato for further exploration. The set contains a bit over 400.000 patents, covering the period from 1979 to 1988.
By selecting a few key fields, and by pointing Marcato to the on-line USPTO database, we can quickly set up a dashboard that shows the relation between patent attributes (such as status, class and location) and patent titles, and explore these all by filing or approval date.
With SynerScope Marcato we can also quickly identify temporal trends and explore the activity of certain groups of patents over time. When selecting a group of patents, key words directly pop up that identify this subset, truly combining analytics and search.
Patents on resin and polymer compositions, filed in categories 260/029, 260/042 and some others in the same range seem to have dropped off sharply after 1980.
Deleting your Cloud Dataproc cluster
You only need to run Cloud Dataproc clusters when you need then. When you are done with your cluster you can delete with the following command:
gcloud beta dataproc clusters delete test-cluster
You can also use the Google Developers Console to delete the cluster.