Author: Jorik Blaas
Let’s start by introducing the three key components:
- SynerScope is a deeply interactive any-data visual analytics platform for Big Data sense-making.
- Apache Spark is a lightning fast framework for in-memory analytics on Big Data.
- IBM Power8 is a high-bandwidth low-latency scaleable hardware architecture for diverse workloads.
In a world where the speed and volume of data is increasing by the day, being able to scale is an increasingly stringent demand. Scale is not only about being able to store a large amount of data, but as data size grows, it gets gradually more difficult to move data. In classic architectures, running analytics used to be something that you did in your analytic data-warehouse, and moving an aggregated, filtered or sampled dataset from your main storage into the data-warehouse was an acceptable solution. Now that analytics touches a growing number of data sources, each of ever increasing size, moving the data is less of an option.
To provide fast turnaround time in deep analytics, the computation has to be moved close to the data, not the other way around. Hadoop has brought this technology to general availability with MapReduce over the past half decade, but it always has remained a programming model that was difficult to understand, as the concepts originated in High Performance Computing.
Apache Spark is the game changer currently moving at incredible speed in this space, as it offers an unprecedented open toolkit for machine learning, graph analytics, streaming and SQL.
While most of the world is running Spark on Intel hardware, Spark as a technology is platform independent, which opens the doors for alternative platforms, such as OpenPOWER. IBM is heavily committed to developing Spark, as announced last June.
After building Apache Spark on our Power8 machine, we were able to instantly run our existing python and scala code. We noticed that the Power8 architecture is especially favorable towards jobs with a high memory bandwidth demandUsing a dataset of a five-year history of github, (100GB of gzipped JSON files), we were able to churn through the entire set in under an hour, processing over 100 million events. After processing, we can load the resulting dataset into SynerScope for a deeper inspection.
The image below shows the top 100.000 most active projects, grouped by co-committers. Projects that share committers are close to each other. Interestingly, this type of involvement-based grouping shows very clearly how different programmer communities are separated. The island of iPhone development (in orange) is really isolated from the island of Android developers.
With Spark on Power8, we were able to handle a huge dataset, reduce it into its key characteristics and it allowed us to make sense of complex mixed sources.