Garbage In, Garbage Out: Why Data Quality is the Foundation for Good Analytics

Garbage In Garbage Out
“Analytics is the discovery and communication of meaningful patterns in data.” (Wikipedia, May 2014)

What Are Analytics

Per its definition, analytics is an umbrella term for the process of gaining knowledge from data and communicating meaningful insights. Companies are increasingly turning to analytics of business data to evaluate and/or improve their marketing mix, sales efforts, inventory management, etc. Business analytics encompasses a variety of techniques, including data analysis, data mining, quantitative statistical analysis, and predictive modeling.

Data analysis is a series of steps for reviewing, cleaning, modifying, and modeling data in order to visualize trends, make predictions, or take certain actions. Arguably, the most important aspect of data analysis is assessing data quality. Data quality focuses on ensuring that data used for analysis is considered “fit for use” by data consumers. This means that the data are accurate, complete, relevant, and readily accessible in a format that can be used for analysis.

Cleaning Your Data

Data_Cleansing_Cycle_350px
Industry reports suggest that more than 60% of company data sources contain a surprisingly large number of data quality issues. Cleaning data is therefore an essential part of data analysis. Data cleaning is typically a two-step process: First, to detect errors in a dataset, and then to correct them.

Frequency counts are often used to assess data quality and detect errors such as:

  • Inaccurate data entry of raw values.
  • Character variables that contain invalid values.
  • Numeric values that fall outside certain ranges.
  • Missing values.
  • Duplicate entries.
  • Values that violate rules for uniqueness.
  • Invalid date values.
  • Statistical Analysis

    Descriptive statistics such as mean, median, standard deviation, as well as maximum and minimum values can also be used to provide simplified summaries of large amounts of data. For instance, consider a dataset of sales of single-family homes by zip code for the previous calendar year.

    Ideally, there would be none or few missing values, especially for important variables like sales price. Additionally, it can be assumed that variables such as square footage or number of days listed would only have positive numeric values. Since sales prices are generally expected to fall inside a reasonable range of values, the mean sales price is expected to be greater than the standard deviation. If not, this would suggest an issue with extreme minimum or maximum values. Finally, by comparing two variables with a general pattern of association between them, outliers with values far outside the expected range can also be identified.

    Quality Matters

    Analysts who use datasets with poor data quality often have to spend as much as half the time needed for analysis on data cleaning in order to avoid drawing erroneous conclusions that could lead to costly mistakes. When using inferential statistics, it is even more important to ensure that a dataset is as complete, correct, and relevant as possible.

    It truly doesn’t matter whether you’re a large company with multiple data warehouses or a start-up that uses spreadsheets; the most challenging part about using data is deciding where to focus your efforts. Make data quality a priority if you rely on data to take actions, make decisions, or predict outcomes.

    About Me: I am a native South African who got started in software when I landed my first data analysis job after college. The intersection of data, software and medicine fascinates me and I love to write software that makes a difference. I am available for hire! More Posts

    About

    I am a native South African who got started in software when I landed my first data analysis job after college. The intersection of data, software and medicine fascinates me and I love to write software that makes a difference. I am available for hire!

    Posted in analytics, Data Science, Database Tagged with: , , ,

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    *