Multivariate Categorical Data

Hi data-enthusiasts,

Recently, I stumbled across an irritatingly odd problem at work concerning the visualization of multidimensional nominal data. In my Job as a data-scientist, I had to create a visualization for our powerpoint reports that provide an overview of our current projects and their status. As I started to deal with the data, I realized how many attributes/tags our projects have and how much information we could provide about our projects. These multidimensional project data got me thinking if there is a way to visualize this data more efficiently. Since there are some very effective data preprocessing & analysis tools for dimensionality reduction with ordinal and numeric data (such as PCA or LDA), I was wondering if there exists a dimensionality reduction technique for nominal data? Does anyone of you know of such a thing or some further information/research about this?
However, the goal of dimensionality reduction remains to be to reduce the dimensionality while preserving as much information as possible. Hence it is difficult for me to imagine how this could work with categorical data. But anyway, let me know what you think because our powerpoint reports really could need a breath of fresh air.
Oh and by the way a late happy new year!

Dimensionality reduction can mean (i) select some variables and remove others, or (ii) find groups of variables and combine the variables in each group into a univariate variable. For nominal variables, you can certain to (i), and parallel coordinates plot can help you with this task. For (ii) you can use some statistical measures to help identify groups of variables, and formulate functions for combining variables (e.g., using OR, AND, XOR, NOT), and transform these to a new nominal variable, ordered ranking, or numerical variable. Because good analysts understand the semantics of these variables, they can define such functions more intelligently than a data mining algorithm that usually does not have such knowledge. Mathematically, human knowledge is a collection of variables, which an algorithm usually does not know.

If you really keen on algorithmic help, you may consider to use a decision tree and random forest algorithm. The resulting decision tree essentially selects “important” variables, and organise them into a tree to represent the level of importance. The underlying mathematics is entropy calculation, which is about “information”. Importance implies information preservation. If your data is sparse, entropy calculation (like all types of stats) will not be accurate, humans can use visualization to address this problem. Please see: G. K. L. Tam, et al. “An analysis of machine- and human-analytics in classification.” IEEE Transactions on Visualization and Computer Graphics, 23(1):71-80, 2017.

Of course, the data mining community has their own forum for offering advice on algorithmic solutions usually without using much visualization or human knowledge.