Visualisation of High-Dimensional Data

Humans do not have the ability to interpret data beyond four or five dimensions. And even then, some people struggle with just three.

Unfortunately, the reality is most data in the real world have multitudes of dimensions. However, all hope is not lost. Using some simple linear algebra, you can process the data and visualise it in a more intuitive manner.

The Data

For this particular problem we are going to take some data from the UCI Machine Learning Repository.

The Covertype Data Set appears to be a good choice. I've selected it because:

  • It has a great number of instances.
  • It has more than 50 dimensions.
  • All the attributes are quantitative.
  • It has no missing values.
  • All instances are already classified.

There are too many instances in the data set to list on this page, so let's take a look at a random selection of 5 instances:

You will notice that the Wilderness Areas and the Soil Types are binary attributes (i.e. if a certain Soil Type is present in that area, then it is 1, otherwise it is 0).

One might think that using an enumeration would be a better idea (i.e. Soil Type = 5 and etc.). However, this is not the case.

Apart from not being able to select multiple types at the same time, the problem with using an enumeration is that the attribute will lose its quantitative property. The soil types do not exist in a spectrum and any ordering will make no mathematical sense. Thus, it is essential that the attributes are recorded in this manner.

The last attribute is the class of the instance. We won't be using this in our calculations, but we will use it to choose the color in the visualisation.

Dimensionality Reduction

Dimensionality reduction removes the least statistically significant dimensions. It is very useful in algorithms with runtime that depends on the space of its dimensions. Adding a dimension to data will increase the size of the space exponentially; a problem called the curse of dimensionality.

Principal Component Analysis

Principal component analysis (PCA) is a procedure that is able to decorrelate the attributes of our instances. The assumption is that our attributes are linearly correlated somehow. If that is indeed true, that means it is possible to represent the data in less dimensions by using a new set of uncorrelated attributes.

Implementation

We will be using the numeric.js library for vector and matrix math support. It does not have as many features as NumPy but its features are good enough.

Before we begin, we must first normalise all our attributes. We do so by doing the literal translation of the formula for normalizing data:

// Calculates the col-wise mean of matrix X
function mean(X){  
  var T = numeric.transpose(X);
  return T.map(function(row){ return numeric.sum(row) / X.length; });
}

// Calculates the col-wise std of matrix X
function std(X){  
  var m = mean(X);
  var sq = numeric.sub(mean(numeric.mul(X,X)), numeric.mul(m,m));
  return sq.map(function(x){ return Math.sqrt(x); });;
}

// Normalises the columns of matrix X
function normalize(X) {  
  var m = mean(X);
  X = X.map(function(row){ return numeric.sub(row, m); });
  var s = std(X);
  X = X.map(function(row){
    return numeric.div(row, s).map(function(x) {
      if (isNaN(x)) return 0;
      else return x;
    });
  });
  return X;
}

Then, we can implement our PCA by using Singular Value Decomposition (SVD) on our matrix to get both and :

We can then calculate our score matrix, , by using this formula:

// Returns the score matrix resulting from the PCA of X
function pca(X) {  
  var svd = numeric.svd(X);
  return numeric.dot(svd.U, numeric.diag(svd.S));
}

Finally, we put everything all together:

// We load our matrix somehow
var matrix = [[...],  
              [...],
               ...
              [...]];

// Normalise the matrix col-wise
matrix = normalize(matrix);

// Project the matrix using principal components as basis
var pc = pca(matrix);  

The Result

First, let us simply plot the first 3 dimensions in our dataset to see if that would make any sense without any processing.

The different colors represent the 7 classes of the instances in the dataset. There appears to be already some meaning to the positions of the classes, but the information is only encoded in 1 direction. The other axes appear to be meaningless.

Below is the Covertype dataset reduced to 3 dimensions by using the method described above. You can use your mouse to rotate and scale the view:

As you can see, there is a marked improvement. There is an increase in separation between the classes. Although there are some classes that are not separable, in general we can see that the classes are in the same vicinity.

We could, for example, use a clustering algorithm like k-means to get centroids of clusters made up of multiple classes (as opposed to one class per cluster). By calculating the euclidean distance between a sample and the centroid, we can classify it as part of a cluster of classes. We could then use another method to distinguish between the classes that exist in the cluster. Or, we could also predict the class with some amount of certainty.

Oh, by the way, just in case you think there are no useful information in the 44 sparse rows of binary attributes, this is what it would look like if you ignore them:

It turns out the information in the 44 rows cannot be ignored despite only consisting of 1s and 0s.

Relevant Modules in NUS

Interested to learn how interpret large amounts of data using math?

You can learn about the concepts in the following modules:

  • MA1101R Linear Algebra 1
  • ST2132 Mathematical Statistics (or ST2334 Probability and Statistics)
  • CS3244 Machine Learning

MA1101R should teach you what you need to know about spaces, basis and eigenvectors and the like. The statistics modules will help you quantify and derive meaning from things like variance, and normalisation. They are both very useful foundation for further modules.

You can learn about classification and regression analysis in CS3244. The models that you can build have useful applications in many other fields, such as the lucrative field of Quantitative Finance.


I hope you've enjoyed this demonstration.

If you have any feedback/questions or if you noticed any mistakes in my article, please contact me at fazli[at]sapuan[dot]org.

Comment section for placebo effect only. Please use email to contact the author