How to Use Causal Inference In Day-to-Day Analytical Work (Part 1 of 2)

Analysts and data scientists operating in the business world are awash in observational data. This is data that’s generated in the course of the operations of the business. This is in contrast to experimental data, where subjects are randomly assigned to different treatment groups, and outcomes are recorded and analyzed (think randomized clinical trials or AB tests).

Experimental data can be expensive or, in some cases, impossible/unethical to collect (e.g., assigning people to smoking vs non-smoking groups). Observational data, on the other hand, are very cheap since they are generated as a side effect of business operations.

Given this cheap abundance of observational data, it is no surprise that ‘interrogating’ this data is a staple of everyday analytical work. And one of the most common interrogation techniques is comparing groups of ‘subjects’ — customers, employees, products, … — on important metrics.

Shoppers who used a “free shipping for orders over $50” coupon spent 14% more than shoppers who didn’t use the coupon.

Products in the front of the store were bought 12% more often than products in the back of the store.

Customers who buy from multiple channels spend 30% more annually than customers who buy from a single channel.

Sales reps in the Western region delivered 9% higher bookings-per-rep than reps in the Eastern region.

Comparisons are very useful and give us insight into how the system (i.e. the business, the organization, the customer base) really works.

And these insights, in turn, suggest things we can do — interventions — to improve outcomes we care about.

Customers who buy from multiple channels spend 30% more annually than customers who buy from a single channel.

30% is a lot! If we could entice single-channel shoppers to buy from a different channel the next time around (perhaps by sending them a coupon that only works for that new channel), maybe they will spend 30% more the following year?

Products in the front of the store were bought 12% more often than products in the back of the store.

Wow! So if we move weakly-selling products from the back of the store to the front, maybe their sales will increase by 12%?

These interventions may have the desired effect if the data on which the original comparison was calculated is experimental (e.g., if a random subset of products had been assigned to the front of the store and we compared their performance to the ones in the back).

But if our data is observational — some products were selected by the retailer to be in the front of the store for business reasons; given a set of channels, some customers self-selected to use a single channel, others used multiple channels— you have to be careful.


Because comparisons calculated from observational data may not be real. They may NOT be a reflection of how your business really works and acting on them may get you into trouble.

How can we tell if a comparison is trustworthy?  Read the rest of the post on Medium to learn how.


Create a Common-Sense Baseline First

When you set out to solve a data science problem, it is very tempting to dive in and start building models.

Don’t. Create a common-sense baseline first.

A common-sense baseline is how you would solve the problem if you didn’t know any data science. Assume you don’t know supervised learning, unsupervised learning, clustering, deep learning, whatever. Now ask yourself, how would I solve the problem?

Read the rest of the post on Medium

I have data. I need insights. Where do I start?

This question comes up often.

It is typically asked by starting data scientists, analysts and managers new to data science. Their bosses are under pressure to show some ROI from all the money that has been spent on systems to collect, store and organize the data (not to mention the money being spent on data scientists).

Sometimes they are lucky – they may be asked to solve a very specific and well-studied problem (e.g., predict which customer is likely to cancel their mobile contract). In this situation, there are numerous ways to skin the cat and it is data science heaven.

But often they are simply asked to “mine the data and tell me something interesting”.

Where to start?

This is a difficult question and it doesn’t have a single, perfect answer. I am sure experienced practitioners have evolved many ways to do this. Here’s one way that I have found to be useful … (read the rest of the post on Medium)

Handy Command-Line One-liners for Starting Data Scientists

[6/5/2017 update: I was asked if I had a PDF version of the one-liners below. Here it is. Data-Science-One-Liners.pdf ]

Experienced data scientists use Unix/Linux command-line utilities (like grep, sed and awk) a great deal in everyday work. But starting data scientists, particularly those without programming experience, are often unaware of the power and elegance of these utilities.

When interviewing candidates for data scientist positions, I ask simple data manipulation questions that can be done with a command-line one-liner. But often the answer is “I will fire up R, import the CSV into a data frame, and then …” or “I will load the data into Postgres and then …”.

The command-line can be much simpler and faster, especially for getting large data files ready for consumption by specialized tools like R. For example, rather than try to load a million-row CSV into R and sample 10% of it, you can quickly create a 10% sample using this one-liner … (read the rest of the post on Medium )

AlphaGo is Here. What’s Next?

One of the most dramatic events in 2016 was the triumph of Google DeepMind’s AlphaGo AI program against Lee Sedol of South Korea, one of the world’s top Go players.

This was a shock to many. Chess fell to AI many years ago but Go was thought to be safe from AI for a while and AlphaGo’s success set off a flurry of questions. Is AI much further along than we think? Are robots with human-level intelligence just around the corner?

Experts have lined up on both sides of these questions and there’s no shortage of perspectives. I wanted to share two that particularly resonated with me.

In an Edge interview on big data and AI (which is a great read in its entirety, btw), Gary Marcus of NYU highlights a key requirement of systems like Google DeepMind’s Atari AI and AlphaGo AI.

You’d think if it’s so great let’s take that same technique and put it in robots, so we’ll have robots vacuum our homes and take care of our kids. The reality is that in the [Google DeepMind] Atari game system, first of all, data is very cheap. You can play the game over and over again. If you’re not sticking quarters in a slot, you can do it infinitely. You can get gigabytes of data very quickly, with no real cost.

If you’re talking about having a robot in your home? – I’m still dreaming of Rosie the robot that’s going to take care of my domestic situation – you can’t afford for it to make mistakes. The DeepMind system is very much about trial and error on an enormous scale. If you have a robot at home, you can’t have it run into your furniture too many times. You don’t want it to put your cat in the dishwasher even once. You can’t get the same scale of data.

This is certainly true in my experience. Without lots and lots of data to learn from, the fancy machine learning/deep learning stuff doesn’t work as well (this is not to say that data is everything; many math/CS tricks contributed to the breakthroughs but lots of data is a must-have).

So is that it? In situations where we can’t have “trial-and-error on an enormous scale”, are we basically stuck?

Perhaps not. Machine learning researcher Paul Mineiro acknowledges this …

In the real world we have sample complexity constraints: you have to perform actual actions to get actual rewards.

… and suggests a way around it.

However, in the same way that cars and planes are faster than people because they have unfair energetic advantages (we are 100W machines; airplanes are much higher), I think “superhuman AI”, should it come about, will be because of sample complexity advantages, i.e., a distributed collection of robots that can perform more actions and experience more rewards (and remember and share all of them with each other).

AIs remembering and sharing with each other. That’s a cool idea.

Perhaps we can’t reduce the total amount of trial-and-error necessary for AIs to learn, but maybe we can “spread the data-collection pain” to thousands of AIs, learn from the pooled data, and push the learning back out to all the AIs and run this loop continuously. If my robot bumps into the furniture, maybe yours won’t have to.

Come to think of it, this “remembering and sharing with each other” is one of the arguments that have been put forth for how homo sapiens evolved from their humble beginnings to today where they can build things like AlphGo.