Multivariate Categorical Data

Hi data-enthusiasts,

Recently, I stumbled across an irritatingly odd problem at work concerning the visualization of multidimensional nominal data. In my Job as a data-scientist, I had to create a visualization for our powerpoint reports that provide an overview of our current projects and their status. As I started to deal with the data, I realized how many attributes/tags our projects have and how much information we could provide about our projects. These multidimensional project data got me thinking if there is a way to visualize this data more efficiently. Since there are some very effective data preprocessing & analysis tools for dimensionality reduction with ordinal and numeric data (such as PCA or LDA), I was wondering if there exists a dimensionality reduction technique for nominal data? Does anyone of you know of such a thing or some further information/research about this?
However, the goal of dimensionality reduction remains to be to reduce the dimensionality while preserving as much information as possible. Hence it is difficult for me to imagine how this could work with categorical data. But anyway, let me know what you think because our powerpoint reports really could need a breath of fresh air.
Oh and by the way a late happy new year!

Dimensionality reduction can mean (i) select some variables and remove others, or (ii) find groups of variables and combine the variables in each group into a univariate variable. For nominal variables, you can certain to (i), and parallel coordinates plot can help you with this task. For (ii) you can use some statistical measures to help identify groups of variables, and formulate functions for combining variables (e.g., using OR, AND, XOR, NOT), and transform these to a new nominal variable, ordered ranking, or numerical variable. Because good analysts understand the semantics of these variables, they can define such functions more intelligently than a data mining algorithm that usually does not have such knowledge. Mathematically, human knowledge is a collection of variables, which an algorithm usually does not know.

If you really keen on algorithmic help, you may consider to use a decision tree and random forest algorithm. The resulting decision tree essentially selects “important” variables, and organise them into a tree to represent the level of importance. The underlying mathematics is entropy calculation, which is about “information”. Importance implies information preservation. If your data is sparse, entropy calculation (like all types of stats) will not be accurate, humans can use visualization to address this problem. Please see: G. K. L. Tam, et al. “An analysis of machine- and human-analytics in classification.” IEEE Transactions on Visualization and Computer Graphics, 23(1):71-80, 2017.

Of course, the data mining community has their own forum for offering advice on algorithmic solutions usually without using much visualization or human knowledge.

Dear dydent,

You can use dimension reduction/multidimensional projection techniques like MDS, ISOMAP, tSNE, UMAP or the like to get a visual map overview of your categorical data. Note that these 2D layouts output of dimensionality reduction are not trustworthy in general: distortions can hide actual patterns (like groups or outliers) or generate false patterns. See a recent survey on that topic [1].
Most of these techniques can handle a similarity or distance matrix as input (notice PCA of a set of numerical vectors V gives the same result as Classical Multidimensional Scaling (cMDS) applied to the Euclidean distance matrix of all pairs of vectors in V). So your problem reduces to finding a way to measure similarities between any pairs of instances described with your variables. For numerical variables you can use Euclidean norm or other p-norm if variables are not semantically linked. If variables are linked (time slots in a time series, pixels of an image, or bins of an histogram), you can use specific similarity measures like Dynamic Time Warping, Wasserstein or measure between discrete sequences [2]. For categorical variables, you can use Hamming distance (each variable is a position, letters code for the different levels on that variable) or other measures like the ones compared in that paper [3]. If you have mixed variables, you can compute a weighted sum of the similarities you got for each group of homogeneous variables.

All the best

[3] Shyam Boriah, Varun Chandola, Vipin Kumar Similarity Measures for Categorical Data 2008