Amazon Cloudsearch With Ruby And Asari Moving Towards 1.0

When you need to offer advanced search in your Ruby based application you have a number of options. People often will look at using a Lucene based solution such as Solr or Elasticsearch. The only problem is that you will need to set up your own instances or spend a lot of money for someone else to provide a search index that may be in a different location from the rest of your servers.

amazon-cloudsearch

If you use a lot of Amazon services the Amazon Cloudsearch offering might be a great solution for you. Up until recently Cloudsearch was based on a proprietary tech stack. Now Cloudearch also uses Lucene/Solr as it’s engine. Long term hopefully they will expose their services using the standard Solr interfaces and API’s which would make integrating as easy as dropping in the Sunspot Gem and using them like any other Solr server. In the meantime we are stuck using the Amazon Cloudsearch API’s.

Active Asari

As is the case with many Amazon Services, using the Cloudsearch API directly is not the most intuitive experience, especially as a Rubiest. Two years ago Tommy Morgan created the Asari gem to make life easier for us. While Asari makes things a lot easier there were some features that we needed for a project that went beyond the gem.

In Cloudsearch you create a domain that contains fields that you search against. We wanted to be able to easily create a dev, test and staging domain in an automated fashion without needing to create custom scripts for each change. In the end we decided to use yaml files that specify the names, types and options for each field in the domain index.

Because this was making Asari opinionated we decided to make it an add-on gem called Active Asari. While we had been in touch with Tommy while Active Asari was under development, it wasn’t until it was released and we talked some more that we both came to the conclusion that the two projects should merge.

Heading Towards 1.0

Two months ago I joined the Asari project as a full committer. After adding in Travis-CI support and going through the backlog of pull requests and issues I got to work on merging ActiveAsari with Asari. Now if you want to use the goodness of ActiveAsari to be able to specify fields to be indexed via a configuration file, or be able to update your Amazon index’s programmatically you can do it without needing to include another gem.

2013-01-01 API Support

The previous version of Asari only supported version 2011-02-01 of the Amazon Cloudsearch API. 1.0 introduces support for the 2013-01-01 version of the API which adds support for more data types, and a bunch of other new features. You can go here for a full list. To maintain backwards compatibility Asari defaults to using the 2011-02-01 version. To use the 2013-01-01 simply set and environment variable called CLOUDSEARCH_API_VERSION to the new version.

The Roadmap

roadmap

At the time of this writing 1.0 has not been released, but can be used by referencing the repo and commit_id here. The only feature gap between the 2011-02-01 API support and 2013-01-01 is geo spacial searching. The old version had to use the INT data type and put logic around it to provide support for these kinds of searches. In the 2013-01-01 version Amazon added a latlon data type and built in proximity search functions. Asari currently supports the creation of a latlon field, along with creating and deleting data in the index. It currently does not support searching against these fields.

Faceting has also been requested. If we have time we would like to add that in, but a firm decision has not been made as to if this will make the cutoff or not. Pull requests are welcome. Either way, if you want to use Amazon Cloud Search and Ruby, Asari will make your life easier. It is actively being developed, well tested and being run in production today. A very special thanks to Playon Sports and Treehouse Software for contributing back to this project. You can find it over here.

Posted in Development, gems, rails, ruby, scaling Tagged with: , , , , , ,

Are You Data Literate?

We live in a world where data has become more important than ever. In fact, the World Economic Forum calls data the new “It currency”, every bit as valuable as oil and gold. Yet, despite the fact that data has an enormous impact on our daily lives and the way we do business, many people don’t really know how to read data. They don’t always know how to interpret statistics or graphs, and they don’t recognize when data is used incorrectly or in a misleading way.

Literacy is defined as the ability to read and write; using the written or printed word to gain and share knowledge. Similarly, data literacy is the ability to read and communicate data as information and use it to draw conclusions and make decisions. Data literacy therefore involves two skills: an understanding of statistics as well as critical thinking.

Seeing the Big Picture

ron_brian_anchor

Brian Fantana: “They’ve done studies, you know. 60% of the time, it works every time.”
Ron Burgundy: “That doesn’t make sense.”

(Anchorman: The Legend of Ron Burgundy, 2004)

Many people are generally averse to dealing with numbers, but keep in mind that you need statistics to make sense of data, and statistics is, at its core, a mathematical science. Hence, you have to get comfortable with basic math in order to become truly data literate. Here are a few other aspects to pay attention to.

Consider context

A single statistic doesn’t mean much on its own. Always consider the context:

– Is the latest figure different from previous ones?
– Does the statistic seem to vary over time?
– How long has data been collected for?
– Is the statistic comparing “apples to apples”?
– Can the statistic be measured in another way?
– What are the assumptions behind the statistic?
– Does the statistic tell the whole story?

Bear in mind that statistics focuses on reaching general conclusions from limited data. It can be very difficult to draw conclusions that would remain appropriate and true even if you significantly increased the size of the sample dataset or used the entire population. In other words, trends observed in a dataset of 100,000 customers may disappear when you look at 5,000,000 records or used data for every single customer on the planet.

Don’t jump to conclusions

office_space_kit_mat

The human brain excels at spotting patterns, which makes it easy to presume that perceived patterns in random data might be trends. That is precisely why we use statistical tests: to determine if, or when, such patterns are, in fact, trends. Consider how, why, and when a statistic was created, and whether the creator might have a hidden agenda. It’s easy to read too much into a statistic when certain facts were withheld for shock value or to sell a product. Finally, watch out for ambiguous conclusions like “could lead to”, “might be responsible for”, or “typical”.

Be clear about terminology

Some statistical terms do not mean what you think they do. For example, when a relationship between variables is statistically significant, it does not necessarily imply that the effect is huge or important. Instead, it means that it’s highly unlikely that the relationship is simply due to chance. Another concept that is often misunderstood is that correlation does not equal causation. Just because there’s an association of some sort between two variables, it doesn’t mean that one causes the other. There might be a third variable that affects both, or the correlation could simply be a coincidence.

Posted in analytics, Architecture, Data Science, Database, Development, scaling, sql Tagged with: , , ,

Using Google Glass For Reveal.js Speaker Notes

I have been a Glass explorer since last November and regularly give a lot of talks at conferences and meetups. One of the first things that I really wanted to have was a way to have my speaker notes be sync’d up with Glass while I am speaking. Initially I created a prototype using Wearscript, Kaas and Keynote. You can read about that here. While the solution worked well, it doesn’t support the latest version of Keynote because Apple removed a bunch of Applescript Keynote API hooks in the latest version.

20140207_122628_302

Even though I love Keynote I decided to look at some other tools and decided to try Reveal.js. Reveal.js is a HTML and Javascript based presentation framework that makes it relatively straight forward to create presentations that will run in any browser. It also already had a plugin that allows you to view speaker notes on another device that is on the same network. Basically all that you need to make it all work is to have a browser app with glass that can be directed towards the right URL using a barcode, some glass specific formatting on the speaker notes and you have the perfect presentation aid.

Step 1 – Install the Glassware

Currently this is not software that is available in the Glass store. Honestly I’m not sure if it would make sense to make it a app store app since the average user of Reveal.js is going to be a technical person that can edit html etc..

First you will need to get the current version of the Android SDK. Once you have it and it is unzipped in a directory you will need to add the adb tool to your path IE:


export PATH=$PATH:~/Downloads/adt-bundle-mac-x86_64-20140321/sdk/platform-tools

Your Glass will need to be in debug mode to sideload the app. To do that go to your seetings menu, swipe over to device info, tap and then swipe to debug mode and tap to turn it on. Next you will need to grab a usb cable and connect your Glass to your computer. Make sure you select the option to trust the computer.

Now go over to this repo and grab the two files that are in the apks directory. At the command prompt cd to the directory that has the apks and then install them to your device via adb:


adb install CaptureActivity.apk
adb install revealjs.apk

If everything is working correctly you should be able to tap into your app menu at the “ok glass” prompt and see the reveal.js app. When you tap into it barcode scanner should come up.

Installing reveal.js

You will want to clone this repo into a local directory. Edit anything that you want for your presentation, update the style etc in the main index.html and associated style sheets. You will also need to include the following required scripts by adding these dependencies:


Reveal.initialize({
    ...

    dependencies: [
        { src: 'socket.io/socket.io.js', async: true },
        { src: 'plugin/notes-server-glass/client.js', async: true }
    ]
});

Install Node.js. Then….


npm install
node plugin/notes-server-glass

You will want to determine the ip address of the machine you are running the presentation on then open up a browser at http://your-ip:1947/ . You will see your main presentation along with two pop up windows. One will be the slide notes formatted for glass and the other will be a QR Code.

Screen Shot 2014-06-11 at 11.04.34 AM

Make sure your Glass is on the same network as your computer, open the reveal.js app on your Glass, scan the QR Code and your notes should come up on your Glass. Now you can use a remote and as you flip through your slides a preview of the next one and any speaker notes will appear on your device.

Customizing Your Presenter View

If you want to change how your notes are displayed you can modify the notes.html file in the plugins/notes-server-glass directory. Our current formatting was done for function over appearance and doesn’t always display the entire next slide in your screen if you have a lot of notes.

We would love to move this beyond Reveal.js and offer a solution that can be used with other presentation tools. Pull requests are always welcome and feel free to reach out to us if you have other ideas etc..

Posted in Development, Glass, Javascript, Mobile, Wearables Tagged with: , , , , ,

RubyNation Recap

rubynation_logo

Recently I had the honor of being able to speak for a second time at Rubynation in Washington DC. Rubynation is a two day, dual track conference that focuses on Ruby and related technologies.

Speaking On The Second Day

This year my talk wasn’t until the second day. On one hand that was a good thing because I really needed the extra day to tweak my presentation, on the other hand I wasn’t able to see some of the talks. While the entire lineup was interesting, two that I really wish I could have seen where the presentations by Sarah Allen and Davy Stevenson.

Other talks of note included the John Paul Ashenfelter’s “Machine Learning for Fun and Profit”, Alex Rothenberg’s “Don’t Let the Cocoa API Crush Your RubyMotion Code”, Evan Light’s “Remote Pairing from The Comfort of Your Own Shell”, Justin Searls “Breaking Up With Your Test Suite” and Yoko Harada’s “It’s Java, But Wait, It’s Ruby”. Russ Olsen closed things out with one of the best keynotes I ever saw called “To The Moon!”

20140607_105910_022

Beyond Web Development

It was great see a number of presentations about using Ruby for things that go beyond web development such as TV displays, big data, mobile applications and artificial intelligence.

Some of my takeaways include:

  • More groups are starting to use Ruby for things that go beyond web development.
  • Hooking into existing libraries for things like AI algorithms and Big Data processing are great ways to allow Ruby to do what it does best, yet work with other technologies which are better suited for the lower level work.
  • There is a general undercurrent that we as a community are going to need to move beyond web development to keep Ruby relevant.
  • Gray and the entire team keep out doing themselves every year and it was great seeing them again!

Ruby Friends

Between Rubynation and Ruby DCamp I have had an opportunity to get to know some wonderful people in the DC community. Every time I go up there is nice to see old Ruby friends and to make new ones!

Posted in Development, Glass, Mobile, rails, Wearables Tagged with: , , , ,

How Much Data Do You Really Need?

We live in a world where knowledge is believed to equal power. As a result, most of us abhor the uncertainty of not knowing. When we have questions, we want to turn to data – cold, hard facts — to provide answers. In the event we don’t have enough data, we develop an almost insatiable thirst for more information. However, because we intrinsically believe that simply “filling in the blanks” will allow us to make better decisions and take smarter actions, we tend to overestimate the relevance and value of more data.
Big-Data
“Big Data” has become the buzzword du jour in almost every industry. Subsequently, many companies are enamored with data science and are investing large sums in initiatives to collect and store more data. Some of these firms are no strangers to highly sophisticated predictive modeling, while others have yet to perform any kind of real analysis, but collectively they all adhere to the tenet that more data is the Holy Grail of analytics.

Alas, nothing could be further from the truth.

Relevancy

As datasets become larger and more complex, they risk becoming less meaningful and more prone to misinterpretation. For this reason, it’s important to first determine which business questions to address, and how it might benefit the organization. Not all questions need to be answered, especially if it’s unclear how much value they truly offer.

Faced with the uncertainty of not having all the facts, decision makers often pursue more data, believing it to be relevant, when, in reality, it would have no impact whatsoever. Subconsciously, the mere emphasis on missing data can lead a person to use that data to make choices he or she would not otherwise have made. In essence, when data is not readily available, the desire to delve deeper is actually fueled by an assumption that what’s out there is potentially so valuable that one simply cannot afford to make a decision without it.

Data is generally considered relevant if it could impact a decision, albeit only in a subtle way. Data is considered instrumental if it could alter a decision entirely. For example, a company’s decision to launch a new product may depend on whether consumer panels respond to it favorably. In this scenario, feedback data is instrumental — the product will only be launched if consumers like it. On the other hand, if the company intends to continue with the launch regardless of panel opinion, feedback data is relevant but non-instrumental – it may affect packaging or marketing, but ultimately the product will still appear on the shelves.

So how do you decide what data might be relevant or instrumental?

Focus on the Problem

Although statistics teaches us that relevancy can be determined through correlation, homogeneity of variance, and regression analysis, it also depends on the problem you’re trying to solve or the question you’re trying to answer. There really are no firm rules, but focusing on the problem or question usually allows you to see that anything in a dataset that doesn’t contribute to an answer or solution is insignificant, and therefore irrelevant. It doesn’t mean the data is not important; it’s just not useful in terms of what you’re trying to accomplish.

For example, in a dataset of online sales, some transactions may be total anomalies. Others may contain obvious errors, be it random or systematic. Such records are considered irrelevant and must be corrected or removed, because unwanted variance could skew the underlying distribution and introduce bias in predictive models. If, however, the business goal is to analyze sales that deviate from the norm, anomalies become highly relevant. Hence, when the problem changes, the perspective also changes, and it determines which data is meaningful.

ROI

Another issue with collecting more data is that it’s easy to hit a point of diminishing returns. Although the cost of data storage has decreased, many businesses wholly underestimate the total investment required for a Big Data initiative. Not only are new technology stacks needed to process and analyze massive datasets; it also creates a need for more skilled employees within an organization to fully leverage these technologies and derive insights from data.

In conclusion: Big Data can quickly become overwhelming if you simply collect and store more data without considering if the data is relevant and instrumental. Do not assume that more data equals better analytics. Instead, take the time to cut datasets to a manageable size and learn to use it to more efficiently.

Posted in analytics, Data Science, Database, Development, scaling Tagged with: , ,

Getting Started With Google Glass Development Using Ruby, the Mirror API and Heroku

While Google Glass is a Android device, you don’t need to be an Android developer to start to create apps for it.

What is the Mirror API

The Mirror API is a Google API based cloud service that hooks into a Glass device using Oauth to sync via the users Google account. Since this Google account is needed to set up glass, users can grant permission to a app via a Oauth sign in and then behind the scenes the api is used to add cards to their timeline, get location updates etc..

Getting Started

To get started you will need to set up a google developers account if you don’t already have one and then login to the Google APIs Console here. Once you are in the console you should see a screen that looks something like this.
Screen Shot 2014-06-01 at 10.04.37 PM
Click “Create Project” to create a new project.
Screen Shot 2014-06-01 at 10.07.16 PM
Now you will be in a screen that looks something like this.
Screen Shot 2014-06-01 at 10.08.59 PM
Now Click on “APIs & Auth”, make sure that “APIs” is selected and scroll down to “Google Mirror API” and toggle it on.
Screen Shot 2014-06-01 at 10.13.17 PM
Now go to credentials and click “Create new Client ID”.
Screen Shot 2014-06-01 at 10.17.58 PM
You will want to choose a web application. For “Authorized Javascript Origins” you will want to enter any servers you are going to use. While working on this sample app I used http://localhost and https://my-app.herokuapp.com. For the “Authorized Redirect URI” you will want to have your Oauth callback endpoint. In my case they were http://localhost/oauth2callback and https://my-app.herokuapp.com/oauth2callback.
Screen Shot 2014-06-01 at 10.28.30 PM
Now you will want to clone the mirror-quickstart-ruby project from my Github repo here. Google offically deprecated their example and it was missing a couple of important things for people to get up to speed quickly. I updated the example to make it easy to deploy onto Heroku by adding in Active Record and Postgres support.

Running locally

In order to run this locally you will need to have Postgres installed. You will also need to set the following environment variables:


export PG_USER=
export PG_PASS=
export PG_PORT=
export RACK_HTTP=http

Next you will need to update the client_secrets.json file with the Client ID and Client Secret from the credentials option under the “APIs & auth” and “Credentails” tab in the developers console.
Screen Shot 2014-06-01 at 10.46.19 PM


{
    "web": {
        "client_id": "your client id",
        "client_secret": "your client secret",
        "redirect_uris": [
            "http://localhost:9292/oauth2callback"
        ],
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://accounts.google.com/o/oauth2/token"
    }
}

Then all that you need to do is to install the gems, start the app and hit it with your browser.


bundle install
rackup

Running on Heroku

Some Mirror API features such as subscribing to location updates require a https endpoint with a verified certificate. Heroku automatically gives you a verified https endpoint when you use the default https://you_app.herokuapp.com address. To run the app on Heroku you will need to make a couple of small changes. Once you have your Heroku app created you will need to set the following environment variable on the dyno using this command.


heroku config:set RACK_HTTP=https

You will also need to modify the client_secrets.json so that your redirect_uris has only the https endpoint for your heroku app.


{
    "web": {
        "client_id": "your client id",
        "client_secret": "your client secret",
        "redirect_uris": [
            "https://.herokuapp.com/oauth2callback"
        ],
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://accounts.google.com/o/oauth2/token"
    }
}

Now push the app up to Heroku and if all works correctly you should see a prompt to authenticate in using Oauth and can then add, delete, follow your location etc..

Next Steps

If you are a regular Rails user it should be pretty easy to integrate into your app. You also might want to use one of the other off the shelf Gems for Oauth etc. Applying erb templates to the cards etc would also be handy and there are a lot of boiler plate things that would be great to have in a Gem. While the Mirror API is not as fully featured as programming directly with the GDK it makes certain apps that require notifications really easy to develop. To explore it in more detail have a look at the documents here.

Posted in Development, Glass, Mobile, rails, ruby, Wearables Tagged with: , , , , ,

A Sane Oauth Federation Strategy With Doorkeeper in Ruby

There are a lot of articles out there about setting up one server with Doorkeeper to offer Oauth support in Ruby projects. But when you start to get into federating your Oauth credentials across services it turns into the wild west. I saw a few examples where people would set up doorkeeper on every service. That may work for some but for us it didn’t feel like the right solution.
Oauth - New Page (1)
In our case we wanted to allow the user to login once and use any of our services with Oauth providing the security for all of our services. Setting up doorkeeper for each service would get crazy after a while. Now the client has to maintain a separate set of tokens for every service and if you want to sync them up to federate a logout you have extra layers of complexity.

Delegation

One way to solve this could be to use delegation. In this scenario all of the services can take oauth requests, but then they pass these requests on to one service that is responsible for all Oauth functionality as the system of record. This is in addition to regular requests that include the authentication token which would require a call out to the main Oauth service.
Oauth passthrough - New Page
There are a few problems with this. First, the main Oauth service is going to get a lot of traffic. Second, that Oauth service is a major point of failure and third how do you handle race conditions when a token expires and a refresh token is requested?

A Hybrid Approach

We ended up adopting a hybrid approach to get the best of both worlds.
Oauth cached - New Page
All of our Oauth functionality is centralized in one service. Now all calls for grant tokens, and token requests from a grant or refresh token go against our one Oauth service. The other services contain a data store (you could use Redis or a SQL database for this). When a request comes in with a Oauth authentication token, if it’s the first time this token has been used with this service, a call is made to the main Oauth service using a secure connection to a custom endpoint. The endpoint verifies if the token is valid. If it is the service returns any user specific metadata we need along with the expiration DateTime. These values are written in the data store on calling service and then the request is serviced successfully. On subsequent requests with the same token the values from the data store are used and no calls go back to the main Oauth service. When the token expires the service returns a 401 and the client is responsible for using the refresh token (if you are using one) to get another authentication token from the Oauth Doorkeeper based service etc.

Show Me the Code

In our services we add a method in our ApplicationController that looked something like the following


class ApplicationController < ActionController::API
  def check_authorization
    authorization = request.headers['Authorization']
    if authorization
      @user = User.where(token: authorization).last 
      if !@user or @user.expires_at < DateTime.now
        party_response = HTTParty.get("http://our_service_url/check_key.json", query: {'signature' => our_unique_req_signature, 'oauth_token' => authorization})        
        parsed_response = party_response.parsed_response
        if parsed_response['user_id']
          @user = User.where(user_id: parsed_response['user_id']).last
          @user = @user? @user : User.new
          @user.update_attributes(user_id: parsed_response['user_id'],
                                email: parsed_response['email'],
                                token: authorization,
                                expires_at: Marshal.load(parsed_response['expires_at'].force_encoding('UTF-8')))
        else
          response.status = 401
          render json: {authorized: false} and return
        end
      end
    else
      response.status = 401
      render json: {authorized: false} and return
    end
  end
end

This code is then used as a before_filter for any actions that need to be protected.

In our Oauth service we add an action to verify the signature.


  def check_key
    if signature_valid? params['signature']
      access_token = Doorkeeper::AccessToken.where(token: params['oauth_token']).last
      provider = nil
      if access_token and access_token.resource_owner_id
        provider = Provider.find access_token.resource_owner_id
      end

      if provider and provider.user_metadata
        metadata = JSON.parse provider.user_metadata
        expires_at = access_token.created_at + access_token.expires_in
        metadata['expires_at'] = Marshal.dump(expires_at.to_s.force_encoding("ISO-8859-1"))  
        render json: metadata
      else
        response.status = 401
        render json: {}
      end
    else
      response.status = 401
      render json: {}
    end    
  end

The Provider object is whatever you set up your resource_owner_id to point to in Doorkeeper. The Doorkeeper::AccessToken is where Doorkeeper stores all of its access tokens.

Now we have one system of record for all Oauth authentication, things are federated, we have maintained a secure system and can handle a large volume of requests throughout the system. The only thing that the client needs to worry about is one Oauth Token instead of three.

Posted in Architecture, Development, gems, rails, ruby, scaling Tagged with: , , , , ,

Before You Can Think Big Data, You Need to Think Big Clean!

dataCleansing-618x294
“Big Data” and “data science” have become buzzwords and focus areas for many companies. However, the key to predictive analytics is not data size or more sophisticated tools; it’s clean data.

Quality Matters

In a previous post about the importance of data quality, we discussed why poor data quality hinder, and often prevent, analysts from performing successful analyses that yield reliable and insights. In order to fully understand a dataset, it is imperative for a data analyst to interact with the data and explore analysis variables, e.g. number of sales or order amount.

This can be very difficult when data sources are disjointed, unreliable, and in formats not conducive to plotting or descriptive statistics. As a result, it is typical for analysts to spend up to 85% of the time specified for analysis on locating, normalizing, and cleaning data. As data grows, the room for error grows equally fast. New technologies are used to capture and store data, input errors occur, and variables are added, removed, or become obsolete whenever product or service offerings change.

Start Small

small_better_small
Since most organizations today have multiple data sources of different ages, sizes, and levels of complexity, data cleaning can seem daunting, costly, and time-consuming. It is therefore a good idea to start small and focus efforts on a dataset that could potentially answer the most pertinent business questions.

A worthwhile practice is to implement staging tables or data middleware to load data in batches and perform initial transformations. Such a repository is especially helpful when attempting to identify weaknesses in data acquisition, as well as factors that contribute to poor data quality. At this stage, a scrubbing tool can often be used to evaluate the data based on preset rules. Once data is readily available in a usable format, the data cleaning process generally involves looking for and correcting obvious errors, outliers, missing values, and duplicity.

Structure

A common issue with raw data is lack of structure. Files may lack separators, headers, and labels, or contain wrong data types and errors in character encoding. If an integrated dataset is not consistently formatted, it should be transformed into a rectangular set that enforces conventions and constraints. This process helps to make a dataset “technically correct”, but data is only truly consistent when missing values, duplicates, and outliers have been augmented, removed, or updated. Variable values must be consistent in record, cross record, and cross-dataset.

Missing Data

missing_data
The next step in data cleaning is to perform a missing data analysis to locate missing values and see if any patterns of missing data exist. Data may be missing due to lack of knowledge, data entry and processing issues, or programming errors. If missing values due to input problems are valid for analysis, missing fields should be imported in ways that would keep the rest of the dataset aligned. It is, however, key to determine if missing values are random, which typically have a limited effect, or systematic, which have a much larger effect on analysis and predictive modeling and usually indicate a more serious problem. Also consider the type of missing values and the general importance of the variable. Decide whether missing values should be imputed with replacement values, manipulated, or removed from the dataset.

Outliers

Outlier handling is usually the final step in data cleaning. Outliers are values that fall far outside the accepted normal range and influence or skew the results of a statistical analysis. Descriptive statistics can be used to check that data fall within logical or acceptable limits. These summaries will identify instances where an age value is 700 years or price is an unrealistic or implausible amount. Again, such records can be disregarded or corrected based on preset rules.

Data cleaning is a multi-step process that requires attention to detail and an ongoing commitment to makes changes at data source level. Use data cleaning methods that make sense for a particular dataset and communicate the importance of data quality to data users and anyone in the organization involved in data capture, entry, or storage.

Summing It Up

In closing: Every day, more and more companies start to tap into the power of data science. That said, without clean and reliable data, analytics are likely to take much longer, cost significantly more, and only yield limited benefit. Don’t make the mistake of focusing on the “big” in “big data”, because you could easily end up with a 1 PB dataset and yet only be able to use 5 KB.

Posted in analytics, Data Science, Database, scaling Tagged with: , , ,

Garbage In, Garbage Out: Why Data Quality is the Foundation for Good Analytics

Garbage In Garbage Out
“Analytics is the discovery and communication of meaningful patterns in data.” (Wikipedia, May 2014)

What Are Analytics

Per its definition, analytics is an umbrella term for the process of gaining knowledge from data and communicating meaningful insights. Companies are increasingly turning to analytics of business data to evaluate and/or improve their marketing mix, sales efforts, inventory management, etc. Business analytics encompasses a variety of techniques, including data analysis, data mining, quantitative statistical analysis, and predictive modeling.

Data analysis is a series of steps for reviewing, cleaning, modifying, and modeling data in order to visualize trends, make predictions, or take certain actions. Arguably, the most important aspect of data analysis is assessing data quality. Data quality focuses on ensuring that data used for analysis is considered “fit for use” by data consumers. This means that the data are accurate, complete, relevant, and readily accessible in a format that can be used for analysis.

Cleaning Your Data

Data_Cleansing_Cycle_350px
Industry reports suggest that more than 60% of company data sources contain a surprisingly large number of data quality issues. Cleaning data is therefore an essential part of data analysis. Data cleaning is typically a two-step process: First, to detect errors in a dataset, and then to correct them.

Frequency counts are often used to assess data quality and detect errors such as:

  • Inaccurate data entry of raw values.
  • Character variables that contain invalid values.
  • Numeric values that fall outside certain ranges.
  • Missing values.
  • Duplicate entries.
  • Values that violate rules for uniqueness.
  • Invalid date values.
  • Statistical Analysis

    Descriptive statistics such as mean, median, standard deviation, as well as maximum and minimum values can also be used to provide simplified summaries of large amounts of data. For instance, consider a dataset of sales of single-family homes by zip code for the previous calendar year.

    Ideally, there would be none or few missing values, especially for important variables like sales price. Additionally, it can be assumed that variables such as square footage or number of days listed would only have positive numeric values. Since sales prices are generally expected to fall inside a reasonable range of values, the mean sales price is expected to be greater than the standard deviation. If not, this would suggest an issue with extreme minimum or maximum values. Finally, by comparing two variables with a general pattern of association between them, outliers with values far outside the expected range can also be identified.

    Quality Matters

    Analysts who use datasets with poor data quality often have to spend as much as half the time needed for analysis on data cleaning in order to avoid drawing erroneous conclusions that could lead to costly mistakes. When using inferential statistics, it is even more important to ensure that a dataset is as complete, correct, and relevant as possible.

    It truly doesn’t matter whether you’re a large company with multiple data warehouses or a start-up that uses spreadsheets; the most challenging part about using data is deciding where to focus your efforts. Make data quality a priority if you rely on data to take actions, make decisions, or predict outcomes.

    Posted in analytics, Data Science, Database Tagged with: , , ,

    HTTParty Cookies and Devise

    Elmo-Cookie-Monster-elmo-2370432-1024-768So you are working in Ruby, you like to party, you like HTTP and you like cookies. For many ruby developers the HTTParty is a great way to make http calls from your ruby code to a web service. When you have an unsecured web service using it is really straight forward. When you are working with secured web services you will need to pass in security credentials. Sometimes it will be a token that is posed as a parameter, sometimes services will require that you set a authentication header and others rely on information being set in cookies.

    Why Cookies?

    Many services will use a token passed in as a parameter, or in the case of Oauth you will set a Authorization header with various keys. In our case we had a service that allows user to login via Devise and another that handled universal login for a number of services that can use our devise based login as one of many mechanisms for authenticating into our system. The details of that solution are the subject of future post.

    In our system the user would log into the devise based system, devise sets a cookie in the user’s browser, then they are redirected to our Oauth system, the Oauth system verifies the credentials by doing a request with the session cookie back to the Devise system, once that comes back successful an Oauth grant is created and a cookie is set in the users browser for use by other web services and the user is redirected to the page they were initially authenticating for.

    Devise used to have a token mechanism for authenticating called TokenAuthenticatable, but in more recent versions it has been taken out. Since we already have a browser that has the session cookie, instead of re-inventing the wheel we decided to create a verification endpoint on the devise based system that checks if the user is logged in and sends a success message if they are. Then our Oauth provider can read the cookie, use HTTParty to send a request with it’s cookie to the devise system, verify it and create it’s Oauth grants etc.. But how do you send cookies with HTTParty?

    Show Me The Code

    Install HTTParty

    
    gem install http-party
    

    The Right Way

    If you only need to send the cookie once you can pass it as a cookies parameter to a call. For example if you are performing a get request your request might look like this.

    
    HTTParty.get "http://www.purrprogramming.com", cookies: {cat_password: 'dogs_drool'}
    

    Some Interesting Side Effects

    You might be tempted to use HTTParty by including it in your class.

    
    class WildPurrProgrammingParty
      include HTTParty
    
      def self.perform_session_actions(catnip_cookie_value)  
        self.get 'http://www.purrprogramming.com', cookies: { _catnip_session: catnip_cookie_value} 
      end
    end
    

    When we run the code.

    
    my_party = WildPurrProgrammingParty.new 'dogs drool'
    my_party.perform_session_actions
    

    The cookies passed to the server will look like this.

    
    {"_catnip_session" => "dogs drool"}
    

    But, if you execute the request again the cookies look like this

    
    {"request_method"=>"GET", "_catnip_session"=>"dlpvODRxdlpSQW5Eb3QyeWZ6bHpkUU9qU2FydGJIWllRdVVIbVFqaTdUMVE2eWs1eUdkSHBkR2JtSXBpRy9qZnhUMzljVlM1dVArOU5yRUVJeEY4bkNSM2xucnAzbUxFT2xyb1VobklVd0lIOHZHRDh2aWQvdTF6WTIvZ2U4ajAxTXBndFJmNWRhM3d5N3YzL0RQbVdHTWhCWjhzSzY1K0RBd0twM1l6VmhqckZvVk5TYzhtNEQ1SnFxWUZXc28vQkk4Sm9jK0EyUUNkbm1yNDIyKy84eC9JTEtweU1TZklRa3hKUDMzYkFkaz0tLXk4THJPS2NwaHFERTZkdXV1Q3lUYlE9PQ==--390aeaf728762b133a0543d7b7f7ed811be530c0", "HttpOnly"=>""}
    

    What happened is that the cookies returned from the first request over-wrote the cookies you passed into the get request on the second call. In this scenario no cookies passed into the get command will be set after you make your first call. Because you are calling this on the class level, this actually sets these cookies for all instances of this class.

    This is the correct way to do it.

    
    class WildPurrProgrammingParty
      include HTTParty
    
      def perform_session_actions(catnip_cookie_value)
        self.cookies({ _catnip_session: catnip_cookie_value})  
        self.class.get 'http://www.purrprogramming.com', {} 
      end
    end
    

    The takeaway from this are that you can set cookies with HTTParty, the safest way is directly calling it from the class but be careful if you do it via a include into your class.

    Posted in Architecture, Development, gems, rails, ruby, scaling Tagged with: , , , , ,