How Much Data Do You Really Need?

We live in a world where knowledge is believed to equal power. As a result, most of us abhor the uncertainty of not knowing. When we have questions, we want to turn to data – cold, hard facts — to provide answers. In the event we don’t have enough data, we develop an almost insatiable thirst for more information. However, because we intrinsically believe that simply “filling in the blanks” will allow us to make better decisions and take smarter actions, we tend to overestimate the relevance and value of more data.
“Big Data” has become the buzzword du jour in almost every industry. Subsequently, many companies are enamored with data science and are investing large sums in initiatives to collect and store more data. Some of these firms are no strangers to highly sophisticated predictive modeling, while others have yet to perform any kind of real analysis, but collectively they all adhere to the tenet that more data is the Holy Grail of analytics.

Alas, nothing could be further from the truth.


As datasets become larger and more complex, they risk becoming less meaningful and more prone to misinterpretation. For this reason, it’s important to first determine which business questions to address, and how it might benefit the organization. Not all questions need to be answered, especially if it’s unclear how much value they truly offer.

Faced with the uncertainty of not having all the facts, decision makers often pursue more data, believing it to be relevant, when, in reality, it would have no impact whatsoever. Subconsciously, the mere emphasis on missing data can lead a person to use that data to make choices he or she would not otherwise have made. In essence, when data is not readily available, the desire to delve deeper is actually fueled by an assumption that what’s out there is potentially so valuable that one simply cannot afford to make a decision without it.

Data is generally considered relevant if it could impact a decision, albeit only in a subtle way. Data is considered instrumental if it could alter a decision entirely. For example, a company’s decision to launch a new product may depend on whether consumer panels respond to it favorably. In this scenario, feedback data is instrumental — the product will only be launched if consumers like it. On the other hand, if the company intends to continue with the launch regardless of panel opinion, feedback data is relevant but non-instrumental – it may affect packaging or marketing, but ultimately the product will still appear on the shelves.

So how do you decide what data might be relevant or instrumental?

Focus on the Problem

Although statistics teaches us that relevancy can be determined through correlation, homogeneity of variance, and regression analysis, it also depends on the problem you’re trying to solve or the question you’re trying to answer. There really are no firm rules, but focusing on the problem or question usually allows you to see that anything in a dataset that doesn’t contribute to an answer or solution is insignificant, and therefore irrelevant. It doesn’t mean the data is not important; it’s just not useful in terms of what you’re trying to accomplish.

For example, in a dataset of online sales, some transactions may be total anomalies. Others may contain obvious errors, be it random or systematic. Such records are considered irrelevant and must be corrected or removed, because unwanted variance could skew the underlying distribution and introduce bias in predictive models. If, however, the business goal is to analyze sales that deviate from the norm, anomalies become highly relevant. Hence, when the problem changes, the perspective also changes, and it determines which data is meaningful.


Another issue with collecting more data is that it’s easy to hit a point of diminishing returns. Although the cost of data storage has decreased, many businesses wholly underestimate the total investment required for a Big Data initiative. Not only are new technology stacks needed to process and analyze massive datasets; it also creates a need for more skilled employees within an organization to fully leverage these technologies and derive insights from data.

In conclusion: Big Data can quickly become overwhelming if you simply collect and store more data without considering if the data is relevant and instrumental. Do not assume that more data equals better analytics. Instead, take the time to cut datasets to a manageable size and learn to use it to more efficiently.

About Me: I am a native South African who got started in software when I landed my first data analysis job after college. The intersection of data, software and medicine fascinates me and I love to write software that makes a difference. I am available for hire! More Posts


I am a native South African who got started in software when I landed my first data analysis job after college. The intersection of data, software and medicine fascinates me and I love to write software that makes a difference. I am available for hire!

Posted in analytics, Data Science, Database, Development, scaling Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *