How Good Management Can Produce Bad Data

As a long-time analytics practitioner, I am well aware of the dangers of using data without fully understanding where it comes from and how it is generated.

There are numerous ways in which the data one gets isn’t quite what it seems: the same data item may be named differently in different systems, different items may have the same name, the data item may be defined in ways subtly different from what commonsense may indicate etc.

All these (and more) are well-known issues that surround data and explain why seasoned analytics experts claim that the lion’s share of an analytics project is likely to be data-cleansing, transformation etc. rather than modeling.

But recently I came across an unusual source of bad data: good management.

We have been working with a retailer on ways to assign their store point-of-sale transactions to households so that we can analyze a family’s purchase patterns across multiple shopping trips.

A common problem in this sort of exercise is the need to group individual shoppers into households. Since different members of a household may have different names, credit cards etc,. the customer is often asked for a phone number at the checkout by the cashier who’s ringing up the sale.

Using third-party databases of landline numbers and mobile numbers, we can identify which of the supplied phone numbers is a landline number. Armed with this, we can collect all the transactions with the same landline number and infer that all these purchases were made by the same household.

Can you spot the weak link in this straightforward scheme?

The cashier has to remember to ask the customer for their phone number. It is extra work for the cashier and when there’s a long line of impatient customers in front of you, it is easy to forget.

So what can we do? Incentives to the rescue!

Management sensibly (after all, they were heeding the legendary Peter Drucker’s advice: “What gets measured gets managed”) decided to give store associates a cash bonus based on how many phone numbers they captured.

As expected, the phone number capture rate went up after the incentives were put in place and the retailer was able to assign many more transactions to households than before.

But we noticed some oddities:

  • Some households visited a single store twenty or thirty times a day!
  • Some households had several hundred store transactions annually!

We studied these odd cases and discovered something interesting: these “crazy shopping” households were really dozens of households rolled into one! The reason these distinct households were grouped together were because they had a common phone number.

And  how did they end up with a common phone number?

Because the cashier who rang up their purchases punched in the same phone number for everyone.

Perhaps these customers declined to supply a phone number, perhaps the cashier neglected to ask, who knows ….

Whatever the reason, for a small number of cashiers, it was just too tempting to simply punch in a fake phone number and make their bonus rather than do the right thing.

After this came to light, the retailer was able to mitigate this “phone number fraud” by first cross-checking every entered phone number against a list of store phone numbers and cashier phone numbers etc. This helped and was a good first step but it was not enough. We are continuing to refine the fraud mitigation algorithm using data mining techniques.

What did I learn from this experience?

I have resolved that whenever I am working with data that was created by people (rather than produced by machines), I will try to understand if the data may be distorted by incentives affecting the behavior of the person(s) creating the data.

And the next time a cashier is ringing up your purchases in a store, see if he/she is entering what looks like a 10-digit number without even asking you 🙂

Share/Bookmark

11 thoughts on “How Good Management Can Produce Bad Data”

  1. I have observed that over the course of creating a relationship with real estate managers, you’ll be able to get them to understand that, in each and every real estate contract, a payment is paid. All things considered, FSBO sellers will not “save” the commission payment. Rather, they struggle to win the commission by way of doing a great agent’s occupation. In accomplishing this, they spend their money in addition to time to carry out, as best they can, the assignments of an realtor. Those duties include disclosing the home by way of marketing, introducing the home to prospective buyers, creating a sense of buyer desperation in order to induce an offer, making arrangement for home inspections, managing qualification inspections with the mortgage lender, supervising repairs, and facilitating the closing of the deal.

  2. @narayan,

    >> the sales clerk goes “have a good day, mr. ven … venk … wait a minute! don’t tell me, i can do this! .. venkasumbarmay .. oh $#*& it! have a good day, sir!” << Hilarious!! -Rama

  3. my local grocery store has a rather simple solution to the problem: every time i check out, i punch in my phone number to get a few bucks off each bill (there a substantial differences between standard prices and “member” prices). clearly, it is in my interest to make sure i punch it in right.

    of course, they registered our address and phone number once; ever since then, they can consolidate all our household purchases regardless who goes, what form of payment they use, etc.

    i guess this works only if (a) both the store and the customer are willing to go through the pain and expense of registering and (b) the store is willing to give the customer some incentive to give them their phone number every time they shop there. (there is a hidden downside to this: when it prints out the receipt, the sales clerk goes “have a good day, mr. ven … venk … wait a minute! don’t tell me, i can do this! .. venkasumbarmay .. oh $#*& it! have a good day, sir!”)

  4. @Sriram: Re: your idea of a privacy-supporting “universal” shopping identifier – see http://www.myreceipts.com, a promising startup with pilots in place with Whole Foods, Best Buy, Amazon.com and others. The attraction is of that its an entirely opt-in process, so you automatically solve the motivational issue. The downside is coverage – it will become truly useful only if it gains widespread acceptance.

    Rama, nice to see you back on the blogging saddle!

  5. Sriram,

    I am glad that my screening algorithm doesn’t have to go up against you! 🙂

    Your thoughts on common fraud patterns are spot-on. We’re seeing progressions, last-two-digit swapping and so on. There is a diminishing returns effect here, of course, and we just want to capture the bulk of the patterns quickly and ignore the rest.

    > If I am asked for a phone number at a grocery store, I am as likely to make up something as the sales clerk is. < Customers do make up numbers (111-111-1111 is the most common) rather than decline and the screening algorithm worries about that too. > Unlike a phone number, e-mail address, etc, this payphrase serves the purpose of telling the store who I am without telling them how they can reach me. < The payphrase idea brings its own set of issues. For the payphrase to work, diff members of the household all have to use the same phrase when they shop , the cashier has to type it in exactly the same way and so on. > So all I have to do at Shop Right next time is say “Unintended Consequences” when the sales clerk asks me for my two-word family identifying phrase and I am all set < 🙂 Rama

  6. Rama,

    You are being too charitable. The post should be titled “How Unintended Consequences can produce bad data”

    Here are a few reactions that I would look out for as you are refining the screening algorithm for fraud detection (OR what would I do if I were a rational, amoral sales clerk):

    a) Arithmetic Progressions: Instead of entering the same phone number, the clerks start creating simple patterns. The oldest trick in the book is a simple AP with a d of 1. This is the algo I use to “change” my password at work. IT is happy, and it is cognitively simple.

    b) Numerograms: Play a bit of switcheroo with the last 4 digits.

    c) Subtle variation on the AP: Increase the area code by 1

    d) Pseudo-random number generation: Keep the area code the same, and just make up the next 7 numbers.

    e) DOBs of ex-girlfriends/boyfriends + 2 randomly generated numbers.

    Needless to say, (d) & (e) will be the hardest to catch.

    A couple of more fundamental issues with the “ask for phone number” method. If I am asked for a phone number at a grocery store, I am as likely to make up something as the sales clerk is. Secondly with the migration to mobile phones, aren’t phone numbers as poor of proxy of household membership as credit cards/names.

    I have been thinking about this problem as a shopper for a while. I like the deals that I get from grocery stores with my preferred membership card. I am also notoriously forgetful when it comes to carrying the store-specific cards with me. I have hoped for a while that I can use a common identifier without compromising my identity across all stores. Something like Amazon’s “One Click Payphrase”(https://payments.amazon.com/sdui/sdui/helpTab/Checkout-by-Amazon/Advanced-Integration-Help/Set-Up-PayPhrase-1-Click-and-Express-Checkout)

    The advantage of this method is that I do not feel I am giving away a piece of my identity. Unlike a phone number, e-mail address, etc, this payphrase serves the purpose of telling the store who I am without telling them how they can reach me. So all I have to do at Shop Right next time is say “Unintended Consequences” when the sales clerk asks me for my two-word family identifying phrase and I am all set…..

  7. @Satish: Great questions.
    > What if they had said “the phone numbers you entered could be audited and could be grounds for termination if entered fradulently”?! < Makes sense but there's a technical issue here. While it is not difficult to check if a number is a real phone number, it is not easy or cheap to check if it is indeed the ph number of that particular customer. > Perhaps a bigger problem with data entered by humans is “fat fingering” type of errors. In the specific case you depicted, what if the cashier’s intentions were genuine but he just entered one or two numbers incorrectly and did not cross-check with the customer? < That happens a lot but it is easy to detect and correct for since that same phone number won't be associated with that same cashier *across* transactions. > the POS UI should pick up on all previous known customer phone numbers in an “auto-complete” fashion < Very good idea but the retailer doesn't do it due to privacy reasons. They don't want anyone, incl the cashier, to have access to the phone number database. > .sales clerks need to understand and appreciae WHY they need to do certain things. In this case, if they were told that it is important to capture numbers accurately because…..they may do the job more conscientiously. Otherwise, they couldn’t care a rat’s backside about what number they capture. < In this case, the sales clerks are actually aware of why it is needed and most of them do follow through appropriately. But there's a minority that still doesn't give a rodent's posterior 🙂

  8. And one more thing….sales clerks need to understand and appreciae WHY they need to do certain things. In this case, if they were told that it is important to capture numbers accurately because…..they may do the job more conscientiously. Otherwise, they couldn’t care a rat’s backside about what number they capture.

    I have seen grocery store clerks scan one item and enter a “X 2” to account for another item that looks similar but has a different bar code. If only they knew the impact on the inventory system, they may minimize such errors.

  9. Great write-up Rama! Companies struggle to do ETL (even though there are lots of tools for the same) because of all the quirks like you pointed out. In this specific example, what management seems to have ignored the “stick” when setting up the procedure. This needs to go hand-in-hand with the “carrot”. What if they had said “the phone numbers you entered could be audited and could be grounds for termination if entered fradulently”?! Perhaps a bigger problem with data entered by humans is “fat fingering” type of errors. In the specific case you depicted, what if the cashier’s intentions were genuine but he just entered one or two numbers incorrectly and did not cross-check with the customer? Here is another thought…maybe the POS UI should pick up on all previous known customer phone numbers in an “auto-complete” fashion. This may even speed up data entry and ensure re-use of actual customer numbers.

    Satish

Leave a Reply

Your email address will not be published. Required fields are marked *