Dataworks Summit Munich and Dreams Coming True

Author: Monique Hesseling

Last week the SynerScope team attended the Dataworks Summit in Munich: “the industry’s premier big data community event”. It was a successful and well-attended event. Attendees were passionate about big data and its applicability to different industries. The more technical people learned (or in the case of our CTO and CEO: demonstrated) how to get most value quickly out of data lakes. Business folks were more interested in sessions and demonstrations on how to get actionable insights out of big data, use cases and KPIs. Most attendees came from the EMEA region, although I regularly detected American accents also.

It has been a couple of years since I last attended a Hadoop/big data event -I believe it was 2013- and it was interesting last week to see the field maturing. Only a few years ago, solution providers and sessions focused primarily on educating attendees on the specifics of Hadoop, data lakes, definitions of big data and theoretical use cases: “wouldn’t it be nice if we could..”. Those days are gone. Already in 2015, Betsy Burton from Gartner discussed in her report “Hype Cycle for Emerging Technologies ”  that big data quickly had moved through  the hype cycle and had become a megatrend, touching on many technologies and ways of automation. This became obvious in this year’s Dataworks Summit. Technical folks questioned how to quickly give their business counterparts access and control over big data driven analytics. Access control, data privacy and multi-tenancy were key topics in many conversations. Cloud versus local still came up, although the consensus seemed to be that cloud is becoming unavoidable, with some companies and industries adopting faster than others. Business people inquired about use cases and implementation successes. Many questions dealt with text analysis, although a fair number of people wanted to discuss voice analysis capabilities and options, especially for call center processes. SynerScope’s AI/machine learning case study of machine-aided tagging and identifying pictures of museum artifacts also got a lot of interest. Most business people however had a difficult time coming up with business cases in their own organizations benefitting from this capability.

This leads me to an observation that was made in some general sessions also: IT and technical people tend to see Hadoop/data lake/big data initiatives as a holistic undertaking, creating opportunities for all sorts of use cases in the enterprise. Business people tend to run their innovation by narrowly defined business cases, which forces them to limit the scope to a specific use case. This makes it difficult to justify and get funding for big data initiatives beyond pilot phases. We probably all would benefit if both business and IT would consider big data initiatives holistically at the enterprise level.  As was so eloquently stated in Thursday’s general session panel: “Be brave! The business needs to think bigger. Big Data addresses big issues. Find your dream projects”!  I thought it was a great message, and it must be rewarding for everybody working in the field that we can start helping people with their dream projects. I know that at SynerScope we get energized by listening to our clients’ wishes and dreams and making these into realities. There still is a lot of work to be done to fully mature big data and big insights, and make dreams come true, but we all came a long way since 2013.  I am sure the next step on this journey to maturity will be equally exciting and rewarding.

Artificial Intelligence ready for “aincient” cultures?

Author: Annelieke Nagel

 

Google, Aincient.org, Synerscope and the Dutch National Museum of Antiquities are creating a revolutionary acceleration in antiquities research

Last Monday I was present at the launch of a fantastic initiative for Egyptian art lovers around the world! A more apt setting was not possible as the presentations were organized in front of the Temple of Taffeh, an ancient Egyptian temple built by order of the Roman emperor Augustus.

Egyptologist Heleen Wilbrink, founder of Aincient.org, Andre Hoekzema, Google country manager Benelux and Jan-Kees Buenen, CEO SynerScope were the presenters that afternoon.

Aincient.org is the driving force behind this pilot project. Thus all presentations were geared towards explaining the need for protection of the world heritage through digitally capturing the art treasures and even more importantly, being able to research them and accelerate discoveries by merging all data sources.  To secure the progress of this kind of research, it also depends on support of outside funds. (If you are interested, please go to www.aincient.org for further information)

The current online collection of the Dutch National Museum of Antiquities (Rijksmuseum van Oudheden (RMO)), consists of around 57,000 items and can now be searched within hours, in a way previously not possible, thanks to SynerScope’s powerful software built on top of Google Cloud Vision API.

The more in-depth technical explanation of the software and partnerships involved, was compelling as it linked Artificial Intelligence and deep learning together with artifacts and an open mind, in order to make this project possible.

This unique pilot program needed to unlock all data available (text, graphs, photos/video, geo, numbers, audio, IoT, biomed, sensors, social) easy and very fast!

The large group of objects (60,000 in this instance but the RMO has another 110,000 more to do) from various siloed databases was categorized and brought together into SynerScope’s data visualisation software: images and texts simultaneously available, linked to a time and location indicator. The system indicates the metadata and descriptions certain items have in common, and the similarities in appearances.

As CEO Jan-Kees Buenen put it: “At SynerScope, we offer quick solutions to develop difficult-to-link data and databases, making them comprehensible and usable”.

Through Aincient.org the RMO online collection can be linked to external databases from other museums around the world. Thus it generated a lot of interest from museums like Teylers Museum Haarlem, Stedelijk Museum Amsterdam and Foundation Digital Heritage (Stichting DEN). They were all present to absorb the state-of-the art information that was presented. Interestingly enough some Egyptologists present expressed their slight scepticism to embrace this new technology to unlock the ancient culture.

We will soon notice that the outcome of the researched data will be used as a source of inspiration for new exposition topics, and I am sure it will also progressively serve the worldwide research community.

I believe this latest technology is the future of the past!

Innovation in action: Horses, doghouses and winter time…

Author: Monique Hesselink

During a recent long flight from Europe, I read up on my insurance trade publications. And although I now know an awful lot more about block chain, data security, cloud, big data and IoT than when I boarded in Frankfurt, I felt unsatisfied by my readings (for the frequent flyers; yes, the airline food might have had something to do with that feeling). I missed real live case studies, examples of all this new technology in action in normal insurance processes, or integration into down-to-earth daily insurer practices. Maybe not always very disruptive, but at least pragmatic and immediately adding value.I know the examples I was looking for are out there, so I got together with a couple of insurance and technology friends and we had a great time identifying and discussing them. For example, the SynerScope team in the Netherlands told me that their exploratory analysis on unstructured data (handwritten notes in claims files, pictures) demonstrated  that an unexplained uptick in home owners claims was caused by events involving horses. Now think about this for a moment: in the classical way of analyzing loss causes we start with a hypothesis and then either verify or falsify that. Honestly, even in my homeland I do not think that any data analyst or actuary would create a hypothesis that horses would be responsible for an uptick in home owners losses. And obviously “damage caused by horse” is not a loss category on the structured claims input either, under home owners coverage. So up to not too long ago, this loss cause either would not have been recognized as significant, or it would have taken analysts enormous amount of time and a lot of luck identifying it by sifting through mass amounts of unstructured data. The SynerScope team figured it out with one person in a couple of days. Machine augmented learning can create very practical insights.

In our talks, we discovered these type of examples all over the world; here in the USA, a former regional executive at a large carrier told me that she found an uptick in house fires in the winter in the South. One would assume that people mistakenly set their house on fire in the winter with fireplaces, electrical heaters etc to stay warm. Although that is true, a significant part of the house fires in rural areas was caused by people putting heating lamps in dog houses: to keep Fido warm. Bad idea.. Again; there was no loss code for “heating lamp in doghouse” in structured claims reporting processes, nor was it a hypothesis that analysts thought to pose. So it took the trending of  loss data over years before the carrier noticed this risk and took action to prevent and mitigate these dreadful losses. Exploratory analysis on unstructured claims file information in a deep machine learning environment, augmented with domain expertise and a human eye -as in the horse example I mentioned earlier- would have identified this risk much faster. We went on and on about case studies like those..

Now, although I am a great believer and firm supporter of healthy disruption in our industry, I think we can support innovation by assisting our carriers with these kind of very practical use cases and value propositions. We might want to focus on practical applications that can be supported by business cases, augmented with some less business case driven innovation and experimenting. I firmly believe that a true partnership between carriers, instech firms and distribution channels and a focus on innovation around real-life use cases will allow for fast incremental innovation and will keep everybody enthusiastic about the opportunities of the many new and exciting technologies available. While doing what we are meant to do; protecting homes, horses and human lives.

First Time Right

First time right: sending the right qualified engineer to the address of installation.

To ensure a continuous and reliable power supply to households in the coming years, energy providers are replacing old meters with new smart meters. The old, traditional meter is not prepared for the smart future and not suitable for new services and applications that help reduce energy consumption.

A wide range of meters of different ages are currently used in houses across the country. Some of these are too old or too dangerous which means only engineers holding special certificates can do the exchange to the new smart meters.  Currently it is guess-work what type of meter is to be replaced upon arrival at the address of installation. And so it happens too often that an engineer has to leave empty-handed as he is unable to carry out the planned job. This means the resident must be present to open the door twice, which is very inconvenient.  The big question is: how to send the right qualified engineer first time round?

For inventory reasons, energy companies started to ask their engineers to take photographs of the meters they repaired or exchanged during these last years. Over the years pictures of meter boxes in all shapes and sizes were gathered. SynerScope is able to take these pictures and add relevant data available from open sources like information on location of homes, date of construction and pictures of neighborhoods. This way SynerScope creates profiles of where a certain type of meter box can be found. As not all meter boxes are documented it is now possible, based on these created profiles, to make the right prediction about the type of meter in a certain home that needs to be replaced. Thus sending the right engineer first time round, leading to happy faces for both the resident and the energy company.

The Panama Papers: advances in technology leave nowhere to hide

Having identified international politicians, business leaders and celebrities involved in webs of suspicious financial transactions, the International Consortium of Investigative Journalists (ICIJ) is now being asked by tax authorities to provide access to the 11 million leaked documents it has been handling over the past year. Meanwhile conspiracy theories are running wild over the source of the leak, which insisted on communicating using only encrypted channels.

Data leaks are becoming more common but also getting larger. The Panama Papers leak contained 11.5 million documents that were created between the 1970s and late-2015 by Mossack Fonseca. The 2.6 terabytes of data is equivalent to 200 high-definition 1080p movie files and far larger than the Edward Snowden trove.

Mixing different sources of data

The Panama Papers show a world of tax evasion and tax dodging. The actors achieve their goals by establishing networks of off-shore companies, some of which completely hide the ultimate beneficiary ownership.

Untangling such networks with conclusive proof on individual entities requires a clear view on activities in rich context. Technology helped the ICIJ with digging through vast amounts of digital data, but it still was a slow and tedious process.

The latest technology from SynerScope demonstrates how to deliver speed at scale for such tasks. Its ability to fast linking of disparate data results in ultra-rich context at speed.

Fast delivery of rich context

In record time SynerScope was able to reveal from the Panama Papers1 all entities; people and companies, and their various relationships and locations involved. The relationships and entities are expanded with the unstructured data of original text and image files.

We show this in the screenshot below by quickly adding-in original patent documentation for those owners whose name also appeared in the Panama Papers.

SynerScope presents the mixture of data in a single pane of glass, where each tile interacts with the other:

Tile 1: The original document from the Panama Papers.

Tile 2: Helps to determine the topics specific to your selected (orange) sub-network.

Tile 3: Shows the network in detail.

Tile 4: The original USPTO patent document.

Tile 5: Shows the location of selected (orange) versus non-selected (grey) entities in the network.

Tile 6: Shows the selected sub-network (orange) against the network overview of all connections.

SynerScope is able to add even more context like similar data from other leaks (SwissLeaks, LuxLeaks, OffshoreLeaks) and Chamber of Commerce data from various countries depending on who is looking. Our technology provides context at high speed, saving thousands of man days of data research.

 

1 https://panamapapers.icij.org/

SynerScope illuminates Dark Data

Author: Stef van den Elzen

Nearly every company is collecting and storing large amounts of data. One of the main reasons for this is because data storage has become very cheap. However, storage may be cheap, the data also needs to be protected and managed which is often not done very well. Obviously, not protecting the data puts your company at a risk. More surprisingly, not managing the data brings an even higher risk. If the data is not carefully indexed and stored, it becomes invisible, underutilized, and eventually is lost in the dark. As a consequence the data cannot be used to the companies advantage to improve the business value. This is what is called dark data, “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” — Gartner.

The potential of dark data is unimagined; performing active exploration and analytics enables companies to implement data-driven decision-making, strategy development, and unlock hidden business value. However, there are two main challenges companies are facing: discovery and analysis.

Discovery

Not only is the dark data invisible, it is often stored in separate data silos; all isolated and separated per process, department, or application, and all are treated the same, despite the widespread variation in value. There is no overview of all data sources or how they are linked and related to each other. Also, because all silos are detached and data is stored for business purposes it lacks structure or metadata that hinders the determination of its original purpose. As a consequence there exists no navigation mechanism to effectively search, explore, and select this wealth of data for further analysis.

Analysis

A large portion, roughly 80-90%, of this dark data is unstructured. So in contrast to numbers it consists of text, images, video, etc. Companies lack the infrastructure and tools to analyze this unstructured data. Business users are not able to directly ask questions to the data but need the help of data scientists. Furthermore, it is important not only to analyze one data source in isolation, as currently occurs with specialized applications, but to link multiple heterogeneous data sources (reports, sensor, geospatial, time-series, images, and numbers) in one unified framework for a better context understanding and multiple perspectives on the data.

Enlightenment

The SynerScope solution helps companies overcome the challenges of discovery and analysis and simultaneously helps customers with infrastructure and architecture.

SynerScope serves as a data lake and provides a world map of the diverse and scattered data landscape. It shows all data sources, the linkage between them, similarity, data quality, and key statistics. Furthermore, it provides navigation mechanisms and full text search for effortless discovery of potential valuable data. In addition, this platform enables collaboration, data provenance, and makes it easy to augment data. Once interesting data is discovered and quality is assessed it is selected for analysis.

With SynerScope all types of data types such as numbers, text, images, network, geospatial and sensor-data can be analyzed all in one unified framework. Questions to the data can be answered instantly while they are formed using intuitive query and navigation interaction mechanisms. Our solution bridges the gap between data scientist and business users and engages a new class of business users to illuminate the dark data silos for a truly data-driven organization. At SynerScope we believe in data as a means, not an end.

Example SynerScope Marcato multi-coordinated visualization setup for rich heterogeneous data analysis; numbers, images, text, geospatial, dynamic network, all linked and interactive.

 

Visual Analytics with TensorFlow and SynerScope

Author: Stef van den Elzen

TensorFlow is an open source software library for numerical computation using data flow graphs. This project is originally developed by the Google Brain team and recently made open source. Enough reason to experiment with this.

Due to the flexible architecture we can use this not only for deep learning but also for generic computational tasks that can be employed on multiple CPU/GPUs and platforms. By combining the computational tasks with SynerScope’s visual frontend that also allows for interactive exploration we have a powerful scalable data sense-making solution. Let’s see how we can do this.

Often when we load a dataset for exploration we do not know exactly what we are looking for in the data. Visualization helps with this by enabling people to look at the data. Interaction gives them techniques to navigate through the data. One of these techniques is selection. Selection, combined with our multiple-coordinated view setup, provides users with a rich context and multiple perspectives on the items they are interested in. One of the insights we are looking for when we make a selection is

“which attribute separates this selection best from the non-selection”.

Or in other words what attribute has specific values for the selection that are clearly different from the values of the non-selection. We can of course see this visually in a scatterplot or histogram for example, but if we have thousands of attributes then this quickly becomes cumbersome to check each attribute manually. We would like to have a ranking of the attributes. We can do this by computing the information gain or gain ratio. This seems like a good opportunity to test out TensorFlow.

Implementation

We implemented the computation of the gain ratio in Python/TensorFlow and discuss the different parts below. The full source code is available at the bottom as an iPython notebook file. First we load the needed modules and define different functions to compute the entropy, information gain, and, gain ratio. Next we define some helper functions for example to sort a matrix for one column, to find splitpoints and to count the number of selected items versus non-selected. Then we read the data and compute for each attribute the gain ratio and the according splitpoint.

Example

Now let’s apply this to a dataset. We take a publicly available dataset[1] about car properties and load these into SynerScope. This dataset contains properties such as the weight of the car, the mpg usage, number of cylinders, horsepower, origin etc. Now we wonder what separates the American cars from the European and Japanese cars. From the histogram in SynerScope Marcato we select the American cars and the gain ratio computation.

American Cars

Attribute gainRatio splitPoint
displacement   0.116024601751 97.25
mpg 0.0969803909296 39.049
weight 0.0886271435463 1797.5
cylinders 0.08334618870 4.0
acceleration 0.0801976681136 22.850
horsepower 0.0435058288084 78.0
year 0.00601950896808 79.5

We see that displacement and mpg are the most differentiation factors for American cars. We can verify this by plotting these on a scatterplot. See figure below, the orange dots are the American cars.

We could also take the cars from 1980 and thereafter and see what separates them most from the other cars. Here we see that besides year, the miles per gallon usage and cylinders are the most differentiating factors. Again we see this in the scatterplot.

Cars produced after 1980

Attribute gainRatio splitPoint
year 0.338440834596  79.5
mpg 0.113162864283  22.349
cylinders 0.100379880419  4.0
horsepower 0.0872011414011  132.5
displacement 0.0866493413084   232.0
weight 0.0861363235593  3725.0
acceleration 0.0501698542653

Conclusion

As the key focus of TensorFlow is on deep learning and neural networks, it can sometimes require some creativity to handle more generic computation, such as the information gain metric we used as an example. By using a hybrid approach where data is moved between TensorFlow structures and numpy arrays, we were able to make a performant implementation. We are anxiously monitoring further developments, as it is a fast-moving platform, and we hope that some features that currently only exist on the numpy side, such as argsort, will be available in due time.

For now, the hybrid combination works well enough, and using TensorFlow for the computation and SynerScope Marcato for the visual exploration gives us a much faster route to understanding our data and discovering new patterns.

 

Resources
[1] Dataset: https://mlr.cs.umass.edu/ml/datasets/Auto+MPG
[2] Source code (iPython notebook): InformationGain

 

SynerScope empowers Apache Spark on IBM Power8 to truly deliver deep analytics

Author: Jorik Blaas

Let’s start by introducing the three key components:

  1. SynerScope is a deeply interactive any-data visual analytics platform for Big Data sense-making.
  2. Apache Spark is a lightning fast framework for in-memory analytics on Big Data.
  3. IBM Power8 is a high-bandwidth low-latency scaleable hardware architecture for diverse workloads.

In a world where the speed and volume of data is increasing by the day, being able to scale is an increasingly stringent demand. Scale is not only about being able to store a large amount of data, but as data size grows, it gets gradually more difficult to move data. In classic architectures, running analytics used to be something that you did in your analytic data-warehouse, and moving an aggregated, filtered or sampled dataset from your main storage into the data-warehouse was an acceptable solution. Now that analytics touches a growing number of data sources, each of ever increasing size, moving the data is less of an option.

To provide fast turnaround time in deep analytics, the computation has to be moved close to the data, not the other way around. Hadoop has brought this technology to general availability with MapReduce over the past half decade, but it always has remained a programming model that was difficult to understand, as the concepts originated in High Performance Computing.

Apache Spark is the game changer currently moving at incredible speed in this space, as it offers an unprecedented open toolkit for machine learning, graph analytics, streaming and SQL.

While most of the world is running Spark on Intel hardware, Spark as a technology is platform independent, which opens the doors for alternative platforms, such as OpenPOWER. IBM is heavily committed to developing Spark, as announced last June.

After building Apache Spark on our Power8 machine, we were able to instantly run our existing python and scala code. We noticed that the Power8 architecture is especially favorable towards jobs with a high memory bandwidth demandUsing a dataset of a five-year history of github, (100GB of gzipped JSON files), we were able to churn through the entire set in under an hour, processing over 100 million events. After processing, we can load the resulting dataset into SynerScope for a deeper inspection.

The image below shows the top 100.000 most active projects, grouped by co-committers. Projects that share committers are close to each other. Interestingly, this type of involvement-based grouping shows very clearly how different programmer communities are separated. The island of iPhone development (in orange) is really isolated from the island of Android developers.

With Spark on Power8, we were able to handle a huge dataset, reduce it into its key characteristics and it allowed us to make sense of complex mixed sources.

 

Analyzing patents with Google Dataproc and SynerScope

Author: Jorik Blaas

Google Cloud Dataproc is the latest publicly accessible beta product in the Google Cloud Platform portfolio, giving users access to managed Hadoop and Apache Spark for at-scale analytics.

In real-life, many datasets are in a format that you cannot easily deal with directly. Patents are a typical example, they mix textual documents and some semi-structured data. The US Patent Office keeps a publicly available database with all patents, and this post shows how Apache Spark, hosted on Cloud Dataproc can be used to process this huge collection of data.

Setup

First of all, get gcloud up and running, you may need to update the binary to support the gcloud beta dataproc command by running gcloud components update.

Within Dataproc, you will have your own cluster to run spark jobs on. To provision this cluster, you can simply run

 

gcloud beta dataproc clusters create test-cluster

 

This will create a Cloud Dataproc cluster with 1 master node and 2 workers, each with 4 virtual CPUs. You can also use the web interface in the Developers Console (under Big Data -> Dataproc).

Basic setup

First of all, let’s submit a simple hello world to check that everything works, create a file hello.py, with the following content:

 

#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
print sc.version
gcloud beta dataproc jobs submit pyspark --cluster test-cluster hello.py

 

If all is well, you will see your python file being submitted into your cluster, and after a while you will see the output, showing you the Apache Spark version number.

Crunching the USPTO patent database

The uspto-pair repository contains a huge number of patent applications, and is publicly available as a Google cloud store (gs://uspto-pair/). However, each single patent is stored as a .zip file, and in order to transform this data into a tabular format, each file needs to be individually decompressed and read. This is a typical match for Spark’s bulk data processing capabilities.

The good news is that Dataproc integrates with Google Cloud Storage transparently, so you do not need to worry about transfers from Cloud Storage to your cluster’s HDFS filesystem.

Each File in the USPTO database is just a flat zip file, like so:

 

 1610 2013-06-11T13:19:36Z gs://uspto-pair/applications/05900088.zip

Each of these ZIP files contains a number of TAB separated files:

 

Archive: 05900088.zip
Zip file size: 1610 bytes, number of entries: 4
?rw------- 2.0 unx 518 b- defN 13-Jun-11 13:19 05900088/README.txt
?rw------- 2.0 unx 139 b- defN 13-Jun-11 13:19 05900088/05900088-address_and_attorney_agent.tsv
?rw------- 2.0 unx 598 b- defN 13-Jun-11 13:19 05900088/05900088-application_data.tsv
?rw------- 2.0 unx 227 b- defN 13-Jun-11 13:19 05900088/05900088-transaction_history.tsv
4 files, 1482 bytes uncompressed, 992 bytes compressed: 33.1%

 

The main file of interest is the -application_data.tsv file, it contains a tab-delimited record with all key information for the patent:

 

Application Number
Filing or 371 (c) Date
Application Type
Examiner Name
Group Art Unit
Confirmation Number
Attorney Docket Number
Class / Subclass
First Named Inventor
Entity Status
Customer Number
Status
Status Date
Location
Location Date
Earliest Publication No
Earliest Publication Date
Patent Number
Issue Date of Patent
AIA (First Inventor to File)
Title of Invention
  05/900,088
04-26-1978
Utility

2504
3077

250/211.00J
JOSEPH BOREL (FR)
Undiscounted

Patented Case
03-06-1979
FILE REPOSITORY (FRANCONIA)
04-14-1995


4,143,266
03-06-1979
No
METHOD AND DEVICE FOR DETECTING RADIATIONS

 

We will iterate through all of the zip files using Spark’s binaryFiles primitive, and for each file use python zipfile to get to the contents of the tsv file within. Each tsv will then be converted into a python dictionary with key/value pairs similar to the original structure.

We have written a script to map the zip files into data, basically defining the data transformation step from raw into structured data. Thanks to the brevity of PySpark code, we can list it here in full:

import pyspark
import zipfile
import cStringIO

sc = pyspark.SparkContext()

# utility function to process a binary zipfile and extract 
# the content of -application_data.tsv from it
def getApplicationTSVFromZip(x):
  fn,content = x
  # open the zip file from memory
  zip = zipfile.ZipFile(cStringIO.StringIO(content))
  for f in zip.filelist:
    # extract the first file of interest
    if f.filename.endswith("-application_data.tsv"):
      return zip.open(f).read()

# given the -application_data.tsv contents, return a dictionary
# with key/value pairs for the data contained in it
def TSVToRecords(x):
  # tab separated records, with two columns, one record per line
  lines = [_.split("\t", 2) for _ in x.split("\n")]
  oklines = filter( lambda x: len(x)==2, lines )
  # turn them into a key/value dictionary
  d = {a: b for (a, b) in oklines if b != '-'}
  return d

# load directly from the google cloud storage
# take all of the patents starting with number 0600....
q = sc.binaryFiles("gs://uspto-pair/applications/0600*")
d = q.map( getApplicationTSVFromZip ).map( TSVToRecords ).repartition(64)
d.saveAsPickleFile("gs://dataproc-UUID/uspto-out-0600")

 

We can convert the intermediary file (with all the pickled dictionaries):

 

import pyspark
import csv
import cStringIO

sc = pyspark.SparkContext()

def csvline2string(one_line_of_data):
 si = cStringIO.StringIO()
 cw = csv.writer(si)
 cw.writerow(one_line_of_data)
 return si.getvalue().strip('\r\n')

q = sc.pickleFile("gs://dataproc-33c9ac81-ff09-4b42-be89-57e56c695739-eu/uspto-out-0600-2")
keys = q.flatMap(lambda x: x.keys()).distinct().collect()
print keys
records = q.map(lambda x: [str(x.get(k) or '') for k in keys])
csvrecords = records.map( csvline2string )
csvrecords.saveAsTextFile("gs://dataproc-33c9ac81-ff09-4b42-be89-57e56c695739-eu/uspto-out-0600-csv")

 

The resulting csv file will contain one row for each patent:

 

Confirmation Number,Patent Number,Attorney Docket Number,Location,Class / Subclass,First Named Inventor,Group Art Unit,Status Date,Status,Issue Date of Patent,Filing or 371 (c) Date,Title of Invention,Entity Status,Location Date,Application Type,AIA (First Inventor to File
),Application Number,Examiner Name
7981,"4,320,273",I0105991,FILE REPOSITORY (FRANCONIA),219/999.999,MITSUYUKI KIUCHI (JP),2103,03-16-1982,Patented Case,03-16-1982,01-22-1979,APPARATUS FOR HEATING AN ELECTRICALLY CONDUCTIVE COOKING UTENSILE BY MAGNETIC INDUCTION,Undiscounted,06-10-1999,Utility,No,"06/005,57
4","LEUNG, PHILIP H"
7994,"4,245,042",NONE,FILE REPOSITORY (FRANCONIA),435/030,,1302,09-27-1980,Patented Case,01-13-1981,01-22-1979,DEVICE FOR HARVESTING CELL CULTURES,Undiscounted,06-10-1999,Utility,No,"06/005,587",

 

Now that all the patent meta-data has been extracted from the zip files, we can load it straight into SynerScope Marcato for further exploration. The set contains a bit over 400.000 patents, covering the period from 1979 to 1988.

By selecting a few key fields, and by pointing Marcato to the on-line USPTO database, we can quickly set up a dashboard that shows the relation between patent attributes (such as status, class and location) and patent titles, and explore these all by filing or approval date.

With SynerScope Marcato we can also quickly identify temporal trends and explore the activity of certain groups of patents over time. When selecting a group of patents, key words directly pop up that identify this subset, truly combining analytics and search.

Patents on resin and polymer compositions, filed in categories 260/029, 260/042 and some others in the same range seem to have dropped off sharply after 1980.

Deleting your Cloud Dataproc cluster

You only need to run Cloud Dataproc clusters when you need then. When you are done with your cluster you can delete with the following command:

 

gcloud beta dataproc clusters delete test-cluster

 

You can also use the Google Developers Console to delete the cluster.