In a previous post about the importance of data quality, we discussed why poor data quality hinder, and often prevent, analysts from performing successful analyses that yield reliable and insights. In order to fully understand a dataset, it is imperative for a data analyst to interact with the data and explore analysis variables, e.g. number of sales or order amount.
This can be very difficult when data sources are disjointed, unreliable, and in formats not conducive to plotting or descriptive statistics. As a result, it is typical for analysts to spend up to 85% of the time specified for analysis on locating, normalizing, and cleaning data. As data grows, the room for error grows equally fast. New technologies are used to capture and store data, input errors occur, and variables are added, removed, or become obsolete whenever product or service offerings change.
Since most organizations today have multiple data sources of different ages, sizes, and levels of complexity, data cleaning can seem daunting, costly, and time-consuming. It is therefore a good idea to start small and focus efforts on a dataset that could potentially answer the most pertinent business questions.
A worthwhile practice is to implement staging tables or data middleware to load data in batches and perform initial transformations. Such a repository is especially helpful when attempting to identify weaknesses in data acquisition, as well as factors that contribute to poor data quality. At this stage, a scrubbing tool can often be used to evaluate the data based on preset rules. Once data is readily available in a usable format, the data cleaning process generally involves looking for and correcting obvious errors, outliers, missing values, and duplicity.
A common issue with raw data is lack of structure. Files may lack separators, headers, and labels, or contain wrong data types and errors in character encoding. If an integrated dataset is not consistently formatted, it should be transformed into a rectangular set that enforces conventions and constraints. This process helps to make a dataset “technically correct”, but data is only truly consistent when missing values, duplicates, and outliers have been augmented, removed, or updated. Variable values must be consistent in record, cross record, and cross-dataset.
The next step in data cleaning is to perform a missing data analysis to locate missing values and see if any patterns of missing data exist. Data may be missing due to lack of knowledge, data entry and processing issues, or programming errors. If missing values due to input problems are valid for analysis, missing fields should be imported in ways that would keep the rest of the dataset aligned. It is, however, key to determine if missing values are random, which typically have a limited effect, or systematic, which have a much larger effect on analysis and predictive modeling and usually indicate a more serious problem. Also consider the type of missing values and the general importance of the variable. Decide whether missing values should be imputed with replacement values, manipulated, or removed from the dataset.
Outlier handling is usually the final step in data cleaning. Outliers are values that fall far outside the accepted normal range and inﬂuence or skew the results of a statistical analysis. Descriptive statistics can be used to check that data fall within logical or acceptable limits. These summaries will identify instances where an age value is 700 years or price is an unrealistic or implausible amount. Again, such records can be disregarded or corrected based on preset rules.
Data cleaning is a multi-step process that requires attention to detail and an ongoing commitment to makes changes at data source level. Use data cleaning methods that make sense for a particular dataset and communicate the importance of data quality to data users and anyone in the organization involved in data capture, entry, or storage.
Summing It Up
In closing: Every day, more and more companies start to tap into the power of data science. That said, without clean and reliable data, analytics are likely to take much longer, cost significantly more, and only yield limited benefit. Don’t make the mistake of focusing on the “big” in “big data”, because you could easily end up with a 1 PB dataset and yet only be able to use 5 KB.