Tag Archive for: Hadoop

SynerScope addresses your “white space” of unknown big data in your data-lake

The Netherlands, April 4, 2017 – As every organization is fast becoming a digital organization, powerful platforms that extend the use of data are imperative to use in the enterprize world.

By implementing SynerScope on top of your Hadoop, you are able to solve the whitespace of unknown data, due to the tight integration between the scalability of Hadoop with the best of SynerScope’s Artificial Intelligence (AI) including deep learning.  The result is a reduced Total Cost of Ownership, working with Big Data and it also creates, extremely fast, great value out of your data-lake.

As data science developments happen in your data lake, you currently encounter data latency problems. Hortonworks covers the lifecycle of data: data moving, data at rest and data analytics as a core infrastructure play.

Ixiwa, SynerScope’s backend product, will support and orchestrate data access layers and it will also make your whole data-lake span multiple services.

Hadoop is a platform for distributed storage and processing.  Place SynerScope on top of Hadoop and you gain advantage from deep learning intelligence, through SynerScope’ s Iximeer. It will bootstrap AI projects by providing out of the box interaction between domain expert, analysts and the data itself.

“AI needs good people and good data to get somewhere, so we basically help AI to make the best decision in parallel with first insight, then basic rules, then tuning”, says CTO Jorik Blaas.

We are proud to announce that as of today Hortonworks has awarded SynerScope all 4 (Operations, Governance, YARN and Security) certifications for our platform, which is a first in the history of their technology partners. 

For more info about the awarded badges go to https://hortonworks.com/partner/synerscope/

If you are interested and want to know more about us, there is the opportunity to visit us at the DataWorks Summit in Munich, April 5-6. We like to welcome you at our booth 1001 as well as at the IBM booth 704 and we will be presenting at the breakout session: “A Complete Solution for Cognitive Business” 12:20pm, room 5.

About SynerScope:

Synerscope enable users to analyze large amounts of structured and unstructured data. Iximeer is a visual analysis platform that represents big insights arising from analyzing AI data into a uniform contextual environment that links together various data sources: numbers, text, sensors, media/video and networks. Users can identify hidden patterns across data without specialized skills.

It supports collaborative data discovery, thereby reducing the efforts required for cleaning and modelling data. Ixiwa ingests data, generates metadata from both structured and unstructured files, and loads data into an in-memory database for fast interactive analysis. The solutions are delivered as an appliance or in the cloud. SynerScope can work with a range of databases, including SAP HANA as well as a number of NoSQL and Hadoop sources.

SynerScope operates in the following sectors: Banking, Insurance, Critical Infrastructure, and Cyber Security. Learn more at Synerscope.com.

SynerScope has strategic partnerships with Hortonworks, IBM, NVIDIA, SAP, Dell.

Synerscope is now certified partner with Hortonworks

We are thrilled to announce the certification of SynerScope on Hortonworks data platform! As of now Hadoop based data can be accessed directly from SynerScope.

   

Analyzing patents with Google Dataproc and SynerScope

Author: Jorik Blaas

Google Cloud Dataproc is the latest publicly accessible beta product in the Google Cloud Platform portfolio, giving users access to managed Hadoop and Apache Spark for at-scale analytics.

In real-life, many datasets are in a format that you cannot easily deal with directly. Patents are a typical example, they mix textual documents and some semi-structured data. The US Patent Office keeps a publicly available database with all patents, and this post shows how Apache Spark, hosted on Cloud Dataproc can be used to process this huge collection of data.

Setup

First of all, get gcloud up and running, you may need to update the binary to support the gcloud beta dataproc command by running gcloud components update.

Within Dataproc, you will have your own cluster to run spark jobs on. To provision this cluster, you can simply run

 

gcloud beta dataproc clusters create test-cluster

 

This will create a Cloud Dataproc cluster with 1 master node and 2 workers, each with 4 virtual CPUs. You can also use the web interface in the Developers Console (under Big Data -> Dataproc).

Basic setup

First of all, let’s submit a simple hello world to check that everything works, create a file hello.py, with the following content:

 

#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
print sc.version
gcloud beta dataproc jobs submit pyspark --cluster test-cluster hello.py

 

If all is well, you will see your python file being submitted into your cluster, and after a while you will see the output, showing you the Apache Spark version number.

Crunching the USPTO patent database

The uspto-pair repository contains a huge number of patent applications, and is publicly available as a Google cloud store (gs://uspto-pair/). However, each single patent is stored as a .zip file, and in order to transform this data into a tabular format, each file needs to be individually decompressed and read. This is a typical match for Spark’s bulk data processing capabilities.

The good news is that Dataproc integrates with Google Cloud Storage transparently, so you do not need to worry about transfers from Cloud Storage to your cluster’s HDFS filesystem.

Each File in the USPTO database is just a flat zip file, like so:

 

 1610 2013-06-11T13:19:36Z gs://uspto-pair/applications/05900088.zip

Each of these ZIP files contains a number of TAB separated files:

 

Archive: 05900088.zip
Zip file size: 1610 bytes, number of entries: 4
?rw------- 2.0 unx 518 b- defN 13-Jun-11 13:19 05900088/README.txt
?rw------- 2.0 unx 139 b- defN 13-Jun-11 13:19 05900088/05900088-address_and_attorney_agent.tsv
?rw------- 2.0 unx 598 b- defN 13-Jun-11 13:19 05900088/05900088-application_data.tsv
?rw------- 2.0 unx 227 b- defN 13-Jun-11 13:19 05900088/05900088-transaction_history.tsv
4 files, 1482 bytes uncompressed, 992 bytes compressed: 33.1%

 

The main file of interest is the -application_data.tsv file, it contains a tab-delimited record with all key information for the patent:

 

Application Number
Filing or 371 (c) Date
Application Type
Examiner Name
Group Art Unit
Confirmation Number
Attorney Docket Number
Class / Subclass
First Named Inventor
Entity Status
Customer Number
Status
Status Date
Location
Location Date
Earliest Publication No
Earliest Publication Date
Patent Number
Issue Date of Patent
AIA (First Inventor to File)
Title of Invention
  05/900,088
04-26-1978
Utility

2504
3077

250/211.00J
JOSEPH BOREL (FR)
Undiscounted

Patented Case
03-06-1979
FILE REPOSITORY (FRANCONIA)
04-14-1995


4,143,266
03-06-1979
No
METHOD AND DEVICE FOR DETECTING RADIATIONS

 

We will iterate through all of the zip files using Spark’s binaryFiles primitive, and for each file use python zipfile to get to the contents of the tsv file within. Each tsv will then be converted into a python dictionary with key/value pairs similar to the original structure.

We have written a script to map the zip files into data, basically defining the data transformation step from raw into structured data. Thanks to the brevity of PySpark code, we can list it here in full:

import pyspark
import zipfile
import cStringIO

sc = pyspark.SparkContext()

# utility function to process a binary zipfile and extract 
# the content of -application_data.tsv from it
def getApplicationTSVFromZip(x):
  fn,content = x
  # open the zip file from memory
  zip = zipfile.ZipFile(cStringIO.StringIO(content))
  for f in zip.filelist:
    # extract the first file of interest
    if f.filename.endswith("-application_data.tsv"):
      return zip.open(f).read()

# given the -application_data.tsv contents, return a dictionary
# with key/value pairs for the data contained in it
def TSVToRecords(x):
  # tab separated records, with two columns, one record per line
  lines = [_.split("\t", 2) for _ in x.split("\n")]
  oklines = filter( lambda x: len(x)==2, lines )
  # turn them into a key/value dictionary
  d = {a: b for (a, b) in oklines if b != '-'}
  return d

# load directly from the google cloud storage
# take all of the patents starting with number 0600....
q = sc.binaryFiles("gs://uspto-pair/applications/0600*")
d = q.map( getApplicationTSVFromZip ).map( TSVToRecords ).repartition(64)
d.saveAsPickleFile("gs://dataproc-UUID/uspto-out-0600")

 

We can convert the intermediary file (with all the pickled dictionaries):

 

import pyspark
import csv
import cStringIO

sc = pyspark.SparkContext()

def csvline2string(one_line_of_data):
 si = cStringIO.StringIO()
 cw = csv.writer(si)
 cw.writerow(one_line_of_data)
 return si.getvalue().strip('\r\n')

q = sc.pickleFile("gs://dataproc-33c9ac81-ff09-4b42-be89-57e56c695739-eu/uspto-out-0600-2")
keys = q.flatMap(lambda x: x.keys()).distinct().collect()
print keys
records = q.map(lambda x: [str(x.get(k) or '') for k in keys])
csvrecords = records.map( csvline2string )
csvrecords.saveAsTextFile("gs://dataproc-33c9ac81-ff09-4b42-be89-57e56c695739-eu/uspto-out-0600-csv")

 

The resulting csv file will contain one row for each patent:

 

Confirmation Number,Patent Number,Attorney Docket Number,Location,Class / Subclass,First Named Inventor,Group Art Unit,Status Date,Status,Issue Date of Patent,Filing or 371 (c) Date,Title of Invention,Entity Status,Location Date,Application Type,AIA (First Inventor to File
),Application Number,Examiner Name
7981,"4,320,273",I0105991,FILE REPOSITORY (FRANCONIA),219/999.999,MITSUYUKI KIUCHI (JP),2103,03-16-1982,Patented Case,03-16-1982,01-22-1979,APPARATUS FOR HEATING AN ELECTRICALLY CONDUCTIVE COOKING UTENSILE BY MAGNETIC INDUCTION,Undiscounted,06-10-1999,Utility,No,"06/005,57
4","LEUNG, PHILIP H"
7994,"4,245,042",NONE,FILE REPOSITORY (FRANCONIA),435/030,,1302,09-27-1980,Patented Case,01-13-1981,01-22-1979,DEVICE FOR HARVESTING CELL CULTURES,Undiscounted,06-10-1999,Utility,No,"06/005,587",

 

Now that all the patent meta-data has been extracted from the zip files, we can load it straight into SynerScope Marcato for further exploration. The set contains a bit over 400.000 patents, covering the period from 1979 to 1988.

By selecting a few key fields, and by pointing Marcato to the on-line USPTO database, we can quickly set up a dashboard that shows the relation between patent attributes (such as status, class and location) and patent titles, and explore these all by filing or approval date.

With SynerScope Marcato we can also quickly identify temporal trends and explore the activity of certain groups of patents over time. When selecting a group of patents, key words directly pop up that identify this subset, truly combining analytics and search.

Patents on resin and polymer compositions, filed in categories 260/029, 260/042 and some others in the same range seem to have dropped off sharply after 1980.

Deleting your Cloud Dataproc cluster

You only need to run Cloud Dataproc clusters when you need then. When you are done with your cluster you can delete with the following command:

 

gcloud beta dataproc clusters delete test-cluster

 

You can also use the Google Developers Console to delete the cluster.