Author: Stef van den Elzen
TensorFlow is an open source software library for numerical computation using data flow graphs. This project is originally developed by the Google Brain team and recently made open source. Enough reason to experiment with this.
Due to the flexible architecture we can use this not only for deep learning but also for generic computational tasks that can be employed on multiple CPU/GPUs and platforms. By combining the computational tasks with SynerScope’s visual frontend that also allows for interactive exploration we have a powerful scalable data sense-making solution. Let’s see how we can do this.
Often when we load a dataset for exploration we do not know exactly what we are looking for in the data. Visualization helps with this by enabling people to look at the data. Interaction gives them techniques to navigate through the data. One of these techniques is selection. Selection, combined with our multiple-coordinated view setup, provides users with a rich context and multiple perspectives on the items they are interested in. One of the insights we are looking for when we make a selection is
“which attribute separates this selection best from the non-selection”.
Or in other words what attribute has specific values for the selection that are clearly different from the values of the non-selection. We can of course see this visually in a scatterplot or histogram for example, but if we have thousands of attributes then this quickly becomes cumbersome to check each attribute manually. We would like to have a ranking of the attributes. We can do this by computing the information gain or gain ratio. This seems like a good opportunity to test out TensorFlow.
We implemented the computation of the gain ratio in Python/TensorFlow and discuss the different parts below. The full source code is available at the bottom as an iPython notebook file. First we load the needed modules and define different functions to compute the entropy, information gain, and, gain ratio. Next we define some helper functions for example to sort a matrix for one column, to find splitpoints and to count the number of selected items versus non-selected. Then we read the data and compute for each attribute the gain ratio and the according splitpoint.
Now let’s apply this to a dataset. We take a publicly available dataset about car properties and load these into SynerScope. This dataset contains properties such as the weight of the car, the mpg usage, number of cylinders, horsepower, origin etc. Now we wonder what separates the American cars from the European and Japanese cars. From the histogram in SynerScope Marcato we select the American cars and the gain ratio computation.
We see that displacement and mpg are the most differentiation factors for American cars. We can verify this by plotting these on a scatterplot. See figure below, the orange dots are the American cars.
We could also take the cars from 1980 and thereafter and see what separates them most from the other cars. Here we see that besides year, the miles per gallon usage and cylinders are the most differentiating factors. Again we see this in the scatterplot.
Cars produced after 1980
As the key focus of TensorFlow is on deep learning and neural networks, it can sometimes require some creativity to handle more generic computation, such as the information gain metric we used as an example. By using a hybrid approach where data is moved between TensorFlow structures and numpy arrays, we were able to make a performant implementation. We are anxiously monitoring further developments, as it is a fast-moving platform, and we hope that some features that currently only exist on the numpy side, such as argsort, will be available in due time.
For now, the hybrid combination works well enough, and using TensorFlow for the computation and SynerScope Marcato for the visual exploration gives us a much faster route to understanding our data and discovering new patterns.