Java source code of my project 3:
workspace.rar Size : 5101 Kb Type : rar |
|
This is project I built a density visualization tool, 2-D dotplots to visualize any dataset in UCI repository
Data reader will transform data values to double value in [0,1] for all numerical and categorical variables.
The idea of dotplot is that the dots stack on top of others if they are inside the predefined dot radius. Then the dot column will be centered in the centroid of this group.
Following figure depicts 1-D dotplots. Input data: Prof.Andy's electricity usage in 2007 and 2008. From this presentation, we can easily recognize that he use more electricity in 2007 than 2008.
We can start with comparing Dotplots and Histogram. We can understand why Dotplots provide a better density estimator:
+ Because it reveal where the data are while Histogram is just averaging over the grid.
+ Because Dotpots present a more stable density when we change window size.
+ Because Dotpots give a lower intergarated Square error.
For example, a simple dataset which contain on 4 data points. The next series of figures depict histogram and dotplots when the window size is increased (Window size = 2 * Dot radius).
+ Because it reveal where the data are while Histogram is just averaging over the grid.
+ Because Dotpots present a more stable density when we change window size.
+ Because Dotpots give a lower intergarated Square error.
For example, a simple dataset which contain on 4 data points. The next series of figures depict histogram and dotplots when the window size is increased (Window size = 2 * Dot radius).
The fisrt button is used to load any UCI dataset that user wants to visualize.
On the second tab, user can visualize data using dotplots alone.
The combo box let user select up to 6 differrent color transfer function: rainbow, bipolar, red-blue, circular, temperature, gray.
The color table let user select color for a certain class using slider over selected color scale.
The set of 3 combox associated to x,y,z axis allow user pick up any combination of variables in dataset they want to visualize.
The right table allow user to see instances in dataset. This table is drictectly connected to vtk visualization:
+ Whenever user select a dot, a column or a class, the instances which are associated to that dot, column, class will be highlighted in the table.
+ Whenever user select a row or set of rows , the dots which are associated to these rows are be blacked in vtkPanel.
Sliders allow users change:
+ Opacity of Dotplots.
+ Compress dot: Allow overlapping dots.
+ Change Z scale: Sphere become ellipsoid.
Controller allow user to:
+ Turn on/off axes and input data points.
+ Change dots selection mode (single dot, column or class).
+ Run simulation to eliminate a dot class from visualization so that user can forcus on visualizing the remaining classes (see next Figure).
Max number of Dots Spinner allows user control the maximum size of data from dataset they want to visualize. Because some dataset in UCI contain thousands of instances, this may confuse visualization.
Dot radius let user define the radius of dotplots. For example, if user increses dot radius, a dot can contains more points inside it so we will see fewer but taller dot columns.
As some previous figures shown, Dot Stacks can be overlapped. The maximum overlapping between 2 neighbour dots is less than 1 dot radius. These overlapped area don't show up from user's view.
To give a better felling about density. I propose 2 algorithms to move the dot stack in acceptable distance but the intergrated square error after moving dot column is still smaller than in Histogram.
The first algorithm is hill climming: Iterate through all dot stacks. At each stack, consider 8 different moving directions and select the direction that mostly reduces overlapping area. This algorithm is depicted in the next figure.
To give a better felling about density. I propose 2 algorithms to move the dot stack in acceptable distance but the intergrated square error after moving dot column is still smaller than in Histogram.
The first algorithm is hill climming: Iterate through all dot stacks. At each stack, consider 8 different moving directions and select the direction that mostly reduces overlapping area. This algorithm is depicted in the next figure.
The second algorithm is intergrated overlapping area: the moving direction is combined vector of all pushing vector. Each pushing vector is created by an overlapping stack neighbor and the lenght of pushing vector is proportional to overlapping area . This algorithm is depicted in the next figure.
Funture improvements:
+ This is Assymmetric dotplots. So I will provide symmatic view of dotplots.
+ Build simple classifier using vtk widget and running testing on it to see which instances are misclassified by looking at dataset table on the right.
+ Intergate a penalty function to moving Vector so that a dot column which was moving far away from its orriginal centroid will slow down. For example, a square or cubic penalty function. This make sure that the dot column is not going to move away its original position more than 1 dot radius.
+ This is Assymmetric dotplots. So I will provide symmatic view of dotplots.
+ Build simple classifier using vtk widget and running testing on it to see which instances are misclassified by looking at dataset table on the right.
+ Intergate a penalty function to moving Vector so that a dot column which was moving far away from its orriginal centroid will slow down. For example, a square or cubic penalty function. This make sure that the dot column is not going to move away its original position more than 1 dot radius.
This is my 2nd VTK project. Thank to Alexandro for helping on this project!