The size of digitally stored data increases exponentially each year (terabytes, petabytes, exabytes… these words are all less than 10 years old!), but how can we extract inferred conclusions to turn these vast amounts of data into actionable insights?
In order to help with the analysis of these data sets, mathematical tools have been, and continue to be, extensively developed, often at the expense of simplicity and model interpretability.
Giuseppe Nuti (Global Head of the Central Risk Book and Data Analysis at UBS) along with Lluis Antoni Jimenez and Ingrid Cross (in the data Science team), of the UBS Strategic Development Lab, recently set a goal to build a tool ( links to the article and code at the bottom), which allowed them to extract information from big data sets based on very simple rules in a way that can still be easily interpreted:
- Is a value greater or less than a meaningful threshold?
- Can we split the data set into smaller subsets that appear more likely to occur once they are subdivided?
For instance, if we have a set of people's ages and we wonder whether they play with toys or not, we will probably find that the subset of people who are less than 15 years old will play with toys, while the other group will not. Hence, 15 years old is a meaningful value that we can use to infer new information. In particular, from a population of n people, we would define the age as our feature "x", and the outcome "y" to whether they play toys "y=1", or not "y= 0".
Under these simple rules, we can measure probabilistically if and where are the splits more likely to happen. The likelihood metrics power an algorithm that builds these rules automatically and produces inferred predictions. However, what is the most appealing aspect of this method? When we need to understand what is lying behind the model, we just print the rules out and use some common sense.