“Analytics is the discovery and communication of meaningful patterns in data.” (Wikipedia, May 2014)
What Are Analytics
Per its definition, analytics is an umbrella term for the process of gaining knowledge from data and communicating meaningful insights. Companies are increasingly turning to analytics of business data to evaluate and/or improve their marketing mix, sales efforts, inventory management, etc. Business analytics encompasses a variety of techniques, including data analysis, data mining, quantitative statistical analysis, and predictive modeling.
Data analysis is a series of steps for reviewing, cleaning, modifying, and modeling data in order to visualize trends, make predictions, or take certain actions. Arguably, the most important aspect of data analysis is assessing data quality. Data quality focuses on ensuring that data used for analysis is considered “fit for use” by data consumers. This means that the data are accurate, complete, relevant, and readily accessible in a format that can be used for analysis.
Cleaning Your Data
Industry reports suggest that more than 60% of company data sources contain a surprisingly large number of data quality issues. Cleaning data is therefore an essential part of data analysis. Data cleaning is typically a two-step process: First, to detect errors in a dataset, and then to correct them.
Frequency counts are often used to assess data quality and detect errors such as:
Descriptive statistics such as mean, median, standard deviation, as well as maximum and minimum values can also be used to provide simplified summaries of large amounts of data. For instance, consider a dataset of sales of single-family homes by zip code for the previous calendar year.
Ideally, there would be none or few missing values, especially for important variables like sales price. Additionally, it can be assumed that variables such as square footage or number of days listed would only have positive numeric values. Since sales prices are generally expected to fall inside a reasonable range of values, the mean sales price is expected to be greater than the standard deviation. If not, this would suggest an issue with extreme minimum or maximum values. Finally, by comparing two variables with a general pattern of association between them, outliers with values far outside the expected range can also be identified.
Analysts who use datasets with poor data quality often have to spend as much as half the time needed for analysis on data cleaning in order to avoid drawing erroneous conclusions that could lead to costly mistakes. When using inferential statistics, it is even more important to ensure that a dataset is as complete, correct, and relevant as possible.
It truly doesn’t matter whether you’re a large company with multiple data warehouses or a start-up that uses spreadsheets; the most challenging part about using data is deciding where to focus your efforts. Make data quality a priority if you rely on data to take actions, make decisions, or predict outcomes.