Posts

GERRIT VAN DIJK AWARD for Stef van den Elzen

Dr. ir. Stef van den Elzen, VP Engineering at Synerscope, has won the Gerrit van Dijk Award for Science 2017. One of three Dutch Data Science awards, the Gerrit van Dijk award goes to best Phd thesis by a researcher who graduated between January 2014 and January 2017. Stef has won with his thesis on interactive visualization techniques as part of a successful collaboration between the TU/e and SynerScope.

The Dutch Data Science Awards are an initiative of the Royal Holland society of Sciences and Humanities (KHMW) and the Big Data Alliance (BDA). The festive ceremony took place on June 8, 2017 in the Hodshon House in Haarlem, The Netherlands.

Best Doctoral Project 2015 by Stef van den Elzen

Stef van de Elzen, Visualisation Architect at SynerScope has won best doctoral project 2015 of the Eindhoven University of Technology (TU/e). The festive closing of the Academic Year 2015 – 2016 of Eindhoven University of Technology (TU/e) took place on Friday 1st July 2016.

“The jury has selected the thesis of dr Stef van der Elzen as the best PhD thesis of 2015.The thesis addressed not only in a scientific sound and fundamental manner the topic of “Interactive Visualization of Dynamic Multivariate Networks”; it did this in a very comprehensive manner that included user studies and also a clear perspective on application areas outside the scope of the thesis.”

We congratulate Stef on this achievement.

View Best Doctoral Award here

 

SynerScope illuminates Dark Data

Author: Stef van den Elzen

Nearly every company is collecting and storing large amounts of data. One of the main reasons for this is because data storage has become very cheap. However, storage may be cheap, the data also needs to be protected and managed which is often not done very well. Obviously, not protecting the data puts your company at a risk. More surprisingly, not managing the data brings an even higher risk. If the data is not carefully indexed and stored, it becomes invisible, underutilized, and eventually is lost in the dark. As a consequence the data cannot be used to the companies advantage to improve the business value. This is what is called dark data, “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” — Gartner.

The potential of dark data is unimagined; performing active exploration and analytics enables companies to implement data-driven decision-making, strategy development, and unlock hidden business value. However, there are two main challenges companies are facing: discovery and analysis.

Discovery

Not only is the dark data invisible, it is often stored in separate data silos; all isolated and separated per process, department, or application, and all are treated the same, despite the widespread variation in value. There is no overview of all data sources or how they are linked and related to each other. Also, because all silos are detached and data is stored for business purposes it lacks structure or metadata that hinders the determination of its original purpose. As a consequence there exists no navigation mechanism to effectively search, explore, and select this wealth of data for further analysis.

Analysis

A large portion, roughly 80-90%, of this dark data is unstructured. So in contrast to numbers it consists of text, images, video, etc. Companies lack the infrastructure and tools to analyze this unstructured data. Business users are not able to directly ask questions to the data but need the help of data scientists. Furthermore, it is important not only to analyze one data source in isolation, as currently occurs with specialized applications, but to link multiple heterogeneous data sources (reports, sensor, geospatial, time-series, images, and numbers) in one unified framework for a better context understanding and multiple perspectives on the data.

Enlightenment

The SynerScope solution helps companies overcome the challenges of discovery and analysis and simultaneously helps customers with infrastructure and architecture.

SynerScope serves as a data lake and provides a world map of the diverse and scattered data landscape. It shows all data sources, the linkage between them, similarity, data quality, and key statistics. Furthermore, it provides navigation mechanisms and full text search for effortless discovery of potential valuable data. In addition, this platform enables collaboration, data provenance, and makes it easy to augment data. Once interesting data is discovered and quality is assessed it is selected for analysis.

With SynerScope all types of data types such as numbers, text, images, network, geospatial and sensor-data can be analyzed all in one unified framework. Questions to the data can be answered instantly while they are formed using intuitive query and navigation interaction mechanisms. Our solution bridges the gap between data scientist and business users and engages a new class of business users to illuminate the dark data silos for a truly data-driven organization. At SynerScope we believe in data as a means, not an end.

Example SynerScope Marcato multi-coordinated visualization setup for rich heterogeneous data analysis; numbers, images, text, geospatial, dynamic network, all linked and interactive.

 

Visual Analytics with TensorFlow and SynerScope

Author: Stef van den Elzen

TensorFlow is an open source software library for numerical computation using data flow graphs. This project is originally developed by the Google Brain team and recently made open source. Enough reason to experiment with this.

Due to the flexible architecture we can use this not only for deep learning but also for generic computational tasks that can be employed on multiple CPU/GPUs and platforms. By combining the computational tasks with SynerScope’s visual frontend that also allows for interactive exploration we have a powerful scalable data sense-making solution. Let’s see how we can do this.

Often when we load a dataset for exploration we do not know exactly what we are looking for in the data. Visualization helps with this by enabling people to look at the data. Interaction gives them techniques to navigate through the data. One of these techniques is selection. Selection, combined with our multiple-coordinated view setup, provides users with a rich context and multiple perspectives on the items they are interested in. One of the insights we are looking for when we make a selection is

“which attribute separates this selection best from the non-selection”.

Or in other words what attribute has specific values for the selection that are clearly different from the values of the non-selection. We can of course see this visually in a scatterplot or histogram for example, but if we have thousands of attributes then this quickly becomes cumbersome to check each attribute manually. We would like to have a ranking of the attributes. We can do this by computing the information gain or gain ratio. This seems like a good opportunity to test out TensorFlow.

Implementation

We implemented the computation of the gain ratio in Python/TensorFlow and discuss the different parts below. The full source code is available at the bottom as an iPython notebook file. First we load the needed modules and define different functions to compute the entropy, information gain, and, gain ratio. Next we define some helper functions for example to sort a matrix for one column, to find splitpoints and to count the number of selected items versus non-selected. Then we read the data and compute for each attribute the gain ratio and the according splitpoint.

Example

Now let’s apply this to a dataset. We take a publicly available dataset[1] about car properties and load these into SynerScope. This dataset contains properties such as the weight of the car, the mpg usage, number of cylinders, horsepower, origin etc. Now we wonder what separates the American cars from the European and Japanese cars. From the histogram in SynerScope Marcato we select the American cars and the gain ratio computation.

American Cars

Attribute gainRatio splitPoint
displacement   0.116024601751 97.25
mpg 0.0969803909296 39.049
weight 0.0886271435463 1797.5
cylinders 0.08334618870 4.0
acceleration 0.0801976681136 22.850
horsepower 0.0435058288084 78.0
year 0.00601950896808 79.5

We see that displacement and mpg are the most differentiation factors for American cars. We can verify this by plotting these on a scatterplot. See figure below, the orange dots are the American cars.

We could also take the cars from 1980 and thereafter and see what separates them most from the other cars. Here we see that besides year, the miles per gallon usage and cylinders are the most differentiating factors. Again we see this in the scatterplot.

Cars produced after 1980

Attribute gainRatio splitPoint
year 0.338440834596  79.5
mpg 0.113162864283  22.349
cylinders 0.100379880419  4.0
horsepower 0.0872011414011  132.5
displacement 0.0866493413084   232.0
weight 0.0861363235593  3725.0
acceleration 0.0501698542653

Conclusion

As the key focus of TensorFlow is on deep learning and neural networks, it can sometimes require some creativity to handle more generic computation, such as the information gain metric we used as an example. By using a hybrid approach where data is moved between TensorFlow structures and numpy arrays, we were able to make a performant implementation. We are anxiously monitoring further developments, as it is a fast-moving platform, and we hope that some features that currently only exist on the numpy side, such as argsort, will be available in due time.

For now, the hybrid combination works well enough, and using TensorFlow for the computation and SynerScope Marcato for the visual exploration gives us a much faster route to understanding our data and discovering new patterns.

 

Resources
[1] Dataset: http://mlr.cs.umass.edu/ml/datasets/Auto+MPG
[2] Source code (iPython notebook): InformationGain