Multivariate Categorical Data

min.chen · February 14, 2020, 10:17am

Dimensionality reduction can mean (i) select some variables and remove others, or (ii) find groups of variables and combine the variables in each group into a univariate variable. For nominal variables, you can certain to (i), and parallel coordinates plot can help you with this task. For (ii) you can use some statistical measures to help identify groups of variables, and formulate functions for combining variables (e.g., using OR, AND, XOR, NOT), and transform these to a new nominal variable, ordered ranking, or numerical variable. Because good analysts understand the semantics of these variables, they can define such functions more intelligently than a data mining algorithm that usually does not have such knowledge. Mathematically, human knowledge is a collection of variables, which an algorithm usually does not know.

If you really keen on algorithmic help, you may consider to use a decision tree and random forest algorithm. The resulting decision tree essentially selects “important” variables, and organise them into a tree to represent the level of importance. The underlying mathematics is entropy calculation, which is about “information”. Importance implies information preservation. If your data is sparse, entropy calculation (like all types of stats) will not be accurate, humans can use visualization to address this problem. Please see: G. K. L. Tam, et al. “An analysis of machine- and human-analytics in classification.” IEEE Transactions on Visualization and Computer Graphics, 23(1):71-80, 2017.

Of course, the data mining community has their own forum for offering advice on algorithmic solutions usually without using much visualization or human knowledge.