Einstein Knew the Secret to Data Science. Do You?

When does your analysis stop delivering additional value and simply become more complicated? Einstein knew the answer. Do you? Read more to find out.

The proliferation of the term Big Data has popularized a number of trends. One of these is to utilize all the data variables available with the latest sophisticated techniques to solve any problem, often resulting in complicated models that are confusing and hard to explain. Observing this phenomenon makes us ask how we can know when we have enough data variables and adding new ones does not add any useful information. When is the current model good enough? Is there value in adding another modeling term? Is there a good stopping rule? How do we prevent overkill in our modeling?

A good guide to know when to stop looking for a better solution is to look for elegance and simplicity through analytic parsimony. According to Webster’s, parsimony is “economy in the use of means to an end”. Long before data science became popular, parsimony was highly revered and prized throughout history by philosophers and scientists —e.g., “Entities should not be multiplied unnecessarily”; “If a thing can be done adequately by means of one, it is superfluous to do it by means of several; for we observe that nature does not employ two instruments where one suffices.” “Everything should be made as simple as possible, but not simpler.” As data scientists, we should remember to follow this guidance toward parsimony. The effective but simpler explanations deliver the most effectiveness.

How can we apply this principle to achieve parsimony in our data analysis work? As we iterate through our analysis options and solutions, our work is most parsimonious when adding the next variable or feature does not add value or effectiveness but, instead, causes more noise and confusion in the explanation. This can be seen clearly in this example of model building:

In the illustration, as the analyst builds more complex models (the blue line), the model improves with each iteration (i.e., the error decreases). To achieve a lower error with each successive iteration, the modeler typically will have added additional variables and complexity compared to the prior iteration. However, the validation runs of these models (the orange line) indicate similar improvements over the earlier model iterations, but after the sixth iteration, the models begin to increasingly pick up noise and spurious signals. Thus, the most parsimonious model is the model in iteration six even though model 10 initially appeared to be a better model.

In conclusion, the operative guideline is: Keep It Simple. Know when you are overdoing it. Set up guidelines to monitor when your analysis delivers no additional explanations but just gets more complicated.