Using Expected Value for Classifier Use in Business Problems

“Make $141k today!” – Data scientist to client

I’ve been reading Data Science for Business, by Provost and Fawcett, a very useful book that explains some of the most important principles and topics in data science. The authors’ language and structure helps a lot in developing an intuitive understanding of key data science concepts like model tuning, model evaluation, and various models themselves like decision trees, linear models, and k nearest neighbors. I highly recommend the book if you’re someone who works with data scientists, if you’re a beginner data scientist, or even if you’re a data science expert who’s looking for a good resource to refresh your fundamentals with.

I found this one chapter particularly interesting because it talks about a framework, or way of thinking, that I haven’t really heard about elsewhere. While specific tactics, such as how different kinds of models work, are definitely important and a large part of what a Data Scientist needs to know and be able to do, I think higher level strategy is also important. Anyways, the framework is highly practical, which fits the authors’ theme for the book: that data science isn’t just about analyzing data, but also about understanding the business problem in an analytical way. I wished there was something tangible and interactive to go along with their explanations in this chapter (and others), so I decided to create a guide of sorts, this blog post plus an interactive Jupyter Notebook you can download and play with. The blog post provides context if you haven’t read the corresponding chapter in the book yet, so the Jupyter Notebook is near the end.

If you have the book already, this blog post corresponds to the latter “half” of Chapter 7, “Decision Analytic Thinking I: What Makes a Good Model?”. This guide and especially the Jupyter Notebook assumes that the reader already has some familiarity with the basic ideas of machine learning, such as supervised learning (specifically classification), data pre-processing, holdout set testing, and model evaluation.

You’ll meet them soon

When applying data science to solve business problems: what is the real goal?

Like approaching any sort of problem, you have to uncover what the real goal of a data analytic project is. It can be tempting to get caught up with the surface level question or jump straight into solutions.

For example, questions about customers come up a lot in business: which customers are most likely to churn? Which customers are most receptive to upselling? The idea is that once we can predict which customers are most likely to be upsold, we can call them, try to get them to buy more items like an add-on for the thingamajig they just bought, and generate more revenue for the business. Let’s run with this “upselling” case as an example.

The real business goal for answering “which customers are most receptive to upselling?” is so that we can not only generate more revenue from upselling customers, but also maximize the profit generated from our efforts. Not all customers will be equally likely to be upsold (some are curmudgeons, others might have a real need for the other products we’re selling), those who we do upsell could purchase different amounts of stuff, and the act of upselling costs us time and money (which can also be variable). So how do we even structure a problem like this, and then decide what to do?

Introduction to the expected value framework, and how it helps break down problems

Let’s introduce the expected value framework, and weave it into how we’d structure and break down our business objective for this “upselling” project.

As a quick refresher:

expected value (of a variable) – a predicted value of a variable, calculated as the sum of all possible values, each multiplied by the probability of its occurrence

Basically, what do we anticipate, or expect, the value of some variable to be, given that there is some uncertainty in the chances of different outcomes happening.

Frame the question in terms of expected value

Back to our upselling question. Each customer has his/her own probability of being upsold, and likely amount that they will be upsold for; there’s also a cost to upselling, which we may have to eat if we call a customer who doesn’t want to buy anything else from us. So, thinking in terms of expected value, each customer will have an expected profit, given that we reach out to that customer to try and upsell them. More specifically:

Which means that, assuming we reach out to a customer, the expected value of profit (E(Profit)) equals the probability of upselling the customer (p_u) times the profit we’d get from upselling the customer, plus the probability of failing to upsell the customer (1 minus the probability of upselling the customer) times the profit we’d get from failing to upsell the customer.

Breaking out profit in each potential outcome:

Where v_u is the value, or revenue generated, from upselling the customer, and c is the cost of trying to upsell the customer (we assume the cost is constant across customers for simplicity). Notice in the second half of the equation that if we fail to upsell the customer, the outcome is that we get $0 in revenue and eat the cost (-c) of trying.

Now, the path to obtaining our original business goal, to maximize total profits, is clear: try to upsell all customers where the expected profit of trying to upsell each one is greater than 0 (assuming we don’t have any budget or constraint on how many customers we can upsell to).

Expected value breaks the problem down for us

Also, thinking in terms of expected value has now broken up the problem nicely for us: to figure out the expected profit of trying to upsell a customer, (1) figure out the probability that upselling will work p_u, the (2) value of a successful upsell v_u, and the (3) cost c of trying to upsell a customer.

Now, we can go more low level and think about how we might address each piece analytically. We can build a machine learning model, a classifier, on historical customer data of which kinds of customers were successfully upsold and which kinds weren’t, to address (1) and generate a predicted p_u, or probability that upselling will work, for each customer. For simplicity, we’ll assume that both (2) and (3) are constant are constant across all customers, but technically, you could build another model to predict (2), the value of a successful upsell for a given customer.

More specifically, for (1), our historical customer data is a snapshot of all customers that we’ve previously tried to upsell to, at time t. One column in the data is whether or not (e.g. a 1 or -1, or 1 or 0) we were able to successfully upsell each customer by some future date t+1, say 3 months later; this is the target variable. The other columns, or features, contain data on each customer before time t, such as number of previous purchases, number of times customer has been back to our online store, shipping zip code (which we can estimate income level with), etc.

Now we have a structure, thanks to EV (expected value), for evaluating whether we should try to upsell any individual customer in order to maximize company profits.


Let’s plug in some numbers to see how we might use our structure to make decisions on whether we should try to upsell a customer or not.

Take Customer A. Based off of what we know about other customers that are similar to him, our machine learning model predicts that he has a 91% chance of being upsold, if we call him.

Mr. Moneybags with a monocle and mustache

Let’s assume that if we upsell a customer, they will spend $100 to buy an add-on to the thingamajig they already bought. Let’s also assume that on average, it takes a 30 minute phone call at a salesperson’s hourly wage of $30 / hour, to try to upsell someone, so the cost of upselling is $15.

Therefore, the expected profit for trying to upsell Customer A will be:

E(Profit_A) = 0.91 * (\$100 - \$15) + 0.09 * (-\$15) = \$76

And since the expected profit is positive, it is worth it to try and upsell him, because on average (if we keep trying to upsell people like him), we will generate $76 in profits each time for the company.

Now let’s look at Customer B. Based off of what we know about other customers that are similar to her, our machine learning model predicts that she has a 4% chance of being upsold, if we call her.

“Don’t try to upsell me”

So, the expected profit for trying to upsell Customer B will be:

E(Profit_B) = 0.04 * (\$100 - \$15) + 0.96 * (-\$15) = \$-11

We should not try to upsell customers like Customer B, because on average, we will lose $11 each time.

If we do this expected value calculation for each customer we’re thinking about upselling to, we can arrive at a subset of customers where the expected profit of upselling each one is positive, and thus if we try to upsell all of them, our expected total profit will be maximized.

See this Jupyter Notebook for a full example of training a machine learning model on historical customer data to predict whether or not a customer will be upsold or not, and the associated probabilities of each outcome happening. These probabilities, along with the expected value framework, are then used to show which customers we should try to upsell to maximize our company’s profit.


Note that using the expected value framework to calculate something like expected profit depends entirely on two things: the probabilities of different outcomes (e.g. a customer successfully being upsold or not) and the benefit or cost of each outcome.  Both can be estimated with models and comprehensive data, but not always very well, or it may be impossible in the first place. This is where both business and data understanding come into play: a data scientist has to understand what data is available and what it can be used for, and also understand how the business works so that accurate cost/benefit numbers can be gathered. This also means that the results of using expected value are sensitive to changes in either type of variable, probabilities or cost/benefit numbers. Though the expected value framework can be a practical and structured way to break down a business analytic problem, the data scientist may have to use other methods to inform action if he/she doesn’t have enough confidence in the probability or cost/benefit estimates. Like all things in life, there is no one size fits all approach: the EV framework is a tool in a data scientist’s big toolbox.

Thanks for reading, I’m always open to questions, suggestions, or other kinds of feedback!

List of thought experiments for making hard life decisions

Ruth Chang – How to Make Hard Choices TED Talk

We all know how hard making decisions about own own lives can be sometimes, such as decisions about your career, or your relationships.

Here’s a list of several thought experiments I’ve come across over the years that have personally given me more perspective, making hard decision making a little bit easier sometimes. Though they’re all slightly different, they seem to operate similarly, cutting out fear and external influences to drill into what our deepest personal values are.

  1. Jeff Bezos’s regret minimization framework.
  2. David Brooks’s suggestion to ask “what do I admire?”, not “what do I desire?”.
  3. Ruth Chang’s idea that every hard choice is an opportunity to “become the authors of our own lives”. Watch her full TED Talk (15 minutes), it’s amazing.

I’m not sure if any of these will always give the “right” answer, and I also think that these thought experiments are just part of the puzzle to improve decision making about one’s own life. As Kahneman, Mauboussin, and Munger suggest, we should use a rational decision making framework or even a checklist* because humans are very prone to cognitive biases and shortcuts that can lead to bad decisions. Even as just a piece of the puzzle, these thought experiments have allowed me to think about decisions from different perspectives, which is always valuable.

Please add any other relevant thought experiments, and/or thoughts about decision making!

*I personally use a checklist similar to WRAP, which is simple to remember and covers a majority of the most common cognitive traps we can fall into. The Heath brothers describe WRAP more in Decisive. Using their terminology, the above thought experiments could belong to the “A” step of WRAP, or “attaining distance/perspective”.

Creating a stock market sentiment Twitter bot with automated image processing

One of the side projects I worked on in the past handful of months was Mr. Market Feels: a stock market sentiment Twitter bot that used automated image processing to extract and tweet the value of CNN Money’s Fear and Greed Index every day.


There have been attempts to backtest the predictive power of the Fear and Greed Index when buying and selling the overall stock market index depending on the value (the results suggest there isn’t much much edge for that particular strategy). Anecdotally though, I’ve found the CNN Fear and Greed Index (what I’ll call FGI for short) to be a pretty good indicator of when this bull market has bottomed out during a short-term retracement, and when I used to have more time, have used it to trade options with decent success. Going to CNN’s website every day to check the FGI was a pain, and I also wanted the numerical values in case I wanted to run some analyses in the future, so I wondered if I could automatically extract the daily Fear and Greed Index values.


I saw this as a fun and short coding project that would help me and others while giving me practice with image processing, so I dove in.

The goal was to extract the FGI “value” and “label” from CNN’s site every day. The value of the Index is 53 and the label is “Neutral” in the snapshot of the FGI below:

Extracting the FGI value and label isn’t as easy as using OCR (optical character recognition) on the image and getting the results: for one, there is a lot of extraneous text in the image. Two: the pixel location of the value and label that we want changes as the FGI changes. Three: the relative position of the value and label also changes as the FGI changes. You can see points two and three in the image below: now, the FGI label (“Extreme Fear”) is to the top left of the FGI value (1). In the original image, the FGI label (“Neutral”) is directly right of the FGI value (53).

Why does all of this matter? Because for clean OCR, images need to be standardized. Or at least they do for Tesseract, the open source OCR engine created by Google. In Tesseract’s case, images of text shouldn’t contain any other artifacts (that the engine might try to interpret as text),should be scaled large enough, have as much image contrast as possible (e.g. black text on white), and be either horizontally or vertically aligned.

Most of the pre-processing of the FGI images, like the ones above, to standardize them for Tesseract was straight forward enough. Without going into way too much detail, I used the Python Pillow library to automatically convert the image to black and white, apply image masks to eliminate extraneous parts of the image–like the “speed dial” and the “historical FGI table” on the right hand side–and crop the image down leave only the FGI value and label, like this:

Or this:

Here’s where challenge number three came up: the FGI value and label aren’t always either horizontally or vertically aligned, and this reduced Tesseract’s accuracy. For example, in the first image, the FGI label is diagonal from the FGI value. Running Tesseract OCR on it returns “NOW:[newline]Extreme[newline]Fear”, which completely misses the value “10” because of the diagonal alignment. You can try out Tesseract OCR with the above images, or with your own, here.

An Interdisciplinary Solution of Sorts

One solution to the challenge above split the resulting image into two images, one with the FGI value and a separate one with the label, so that Tesseract could be run on both and know that both images were either horizontally or vertically aligned. Basically, from a single FGI image, I wanted two images that looked like these:


In thinking about ways to implement that, I first thought about the principles of unsupervised clustering, from the field of machine learning. With clustering, the intermediate, processed FGI image could be segmented and split appropriately by finding the cluster of pixels that corresponded to the FGI value (“10”), and the other cluster of pixels that corresponded to the FGI label (“Now: Extreme Fear”).

Turns out that using the k-means clustering algorithm for image segmentation is pretty common practice.

First, a copy of the image was “pixelated” to ensure that the k-means algorithm would converge on the two correct clusters every time:

Then, applying k-means to find the centroids of the two clusters and deriving the separating partition resulted in an image that looked like this:

From there, the original black and white FGI image could be split along the partition line, which would result in the desired two images: one for the FGI value and one for the FGI label. From here, Tesseract would always have these two standardized images as inputs and would be able to cleanly extract the FGI value and label.


Finally, I put the script onto a web server, told a cron job to run it daily, and hooked it up to Twitter’s APIs to automatically post to Mr. Market Feels, named after Ben Graham’s moody Mr. Market.

I just finished reading Poor Charlie’s Almanack (an amazing book full of wisdom and life principles) so Charlie Munger’s multidisciplinary approach to life is on my mind. Though this project was probably a little less multidisciplinary than he means because machine learning and image processing are closely related fields, I still saw it as an example of how broad and varied knowledge and skills can come together to solve a problem effectively. To quote Munger on specialized knowledge: “To the man with only a hammer, every problem looks like a nail.”

Technologies used:

Learning from machine learning: ensembling, and other important skills

In my downtime, I’ve been using Kaggle to get better at applying machine learning to solve problems. The process is not only teaching me new technical skills, but also reminding me of some useful principles that can be applied elsewhere. To keep things digestible, this is the second post of two (the first one is here).

A short list of important skills for a data scientist

When trying to get better at a skill, I try to tackle the highest leverage points–here’s what I’ve been able to gather about three skills that are important in being a data scientist*, from talking with others and reading about machine learning, and experiencing it firsthand with the client projects I do.

  1. Feature engineering
  2. Communication (includes visualization)
  3. Ensembling

The first two are relatively self-explanatory, ensembling brings some pretty interesting concepts that apply to decision-making, in my opinion.

*I’ll be referring to the “applier of machine learning” aspect of “data science”.

Feature engineering

Feature engineering is the process of cleaning, transforming, combining, disaggregating, etc. your data to improve your machine learning model’s predictive performance. Essentially, you’re using existing data to come up with new representations of the data in the hopes of providing more signal to the model–feature selection is removing less useful features, thus feeding the model less noise, which is also good. The practitioner’s own domain knowledge and experience is used a lot here to engineer features in a way that will improve the model’s performance instead of hurt it.

There are a few tactics that can be generally applied to engineer better features, such as normalizing the data to help certain kinds of machine learning models perform better. But usually, the largest “lift” in performance comes from engineering features in a way that’s specific to the domain or even problem.

An example is using someone’s financial data to predict likelihood of default, on a loan for example. You might have the person’s annual income and monthly debt payments (e.g. for auto loans, mortgages, credit cards, the new loan they’re applying for), but those somewhat closer to the lending industry will tell you that a “debt to income ratio” is a better metric for predicting default, because it essentially measures how capable the person is of paying of his/her debt, all in one number. After calculating it, a data scientist would add this feature to the training data, and would find that their machine learning model performs better at predicting default.

As such, feature engineering (and in fact, most of machine learning) is sort of an art vs. a science, where a creative spark for an innovative way to engineer a domain specific feature is more effective than hard and fast rules. They say feature engineering can’t be taught from books, only experience, which is why I think Kaggle is in an interesting position because they’re essentially crowdsourcing the best machine learning methodologies for all sorts of problems and domains. There’s a treasure trove of knowledge on there, and if structured a little better, Kaggle could contribute a lot to machine learning education.


What potentially useful features/data could we engineer from timestamp strings? We could generate year, month, day, day of week, etc. numeric data columns–much more readable by a machine learning model.


During a recent chat with one of the core developers of the Python scikit-learn package, I asked what he thought some of the most important skills for a data scientist are. I sort of expected technical skills, but one of the first things that came up was communication, or being able to convey findings and why those findings matter to both internal and external stakeholders, like customers. This one’s self explanatory–what good is data if you can’t act upon it.

In fact, it seems like communicating well for data scientists might be even more important than it is for professions like programmers or designers because there’s a larger gap between result and action. For example, with a design or app, a decision maker can look at it or play around with it do understand it reasonably well to make decision, whereas a decision maker usually can’t just see a bunch of numbers that were spit out by a machine learning model and know what to do: how are those numbers actionable, why should someone believe those numbers, etc. Visualization is a piece of this, as it’s choosing the right charts, design, etc. to communicate your data’s message most effectively.


In machine learning, an ensemble is a collection of models that can be combined into something that performs better than the individual models.

An example: one way this is done is via the voting method. The different base, or “level 0”, models each make a prediction on, say, whether a person is going to go into default in the next 90 days. Model A predicts “yes”, model B predicts “yes”, and model C predicts “no”. The final decision then becomes the majority vote, here “yes”.

There are many other ways of ensembling models together. An important and powerful one is called stacking, and it is applying another machine learning model–called a “generalizer”, or “level 1” model–on the predictions of the base models themselves. This is better than the voting method because you’re letting the level 1 machine learning model decide which level 0 models to believe more than others based on the training data you feed into the system, instead of arbitrarily saying “the majority rules”.



A high level flow chart of how stacking works.

Ensembling is a key technique in machine learning to improve predictive performance. Why does it work? We all have an intuitive understanding for why it should work, because it’s a decision making framework we all have probably used, or been a part of, before. Different people know different things, and so may make different decisions given a particular problem. When we combine them in some way–like a majority vote in Congress or at the company we work at–we “diversify” away the potential biases and randomness that comes from just following one decision maker. Then, if you add in some mechanism to learn which decision makers should have their decisions weighed more than others based off of past performance, the system can become even more predictive–what areas could benefit from this improved, performance based decision-making process?*

*Proprietary trading companies, where every trade is a data point and thus generated very frequently, do this more intelligent way of ensembling, in a way, by allocating more money to traders who’ve performed better than others historically. A trader who is maybe slightly profitable but makes uncorrelated trades–for example by trading in another asset class–will still be given a decently sized allocation, because his trades hedge other traders’ trades, thus improving the overall performance of the prop trading company. Analogously, in machine learning, ensembling models that make uncorrelated predictions improves overall predictive performance.


Here are some resources related to the topics described above that were recommended to me and that I found most useful, I hope they’re helpful to you too.

  • A good overview of the principles of data science and machine learning for non-technical and technical folk alike: Data Science for Business
  • Code example of stacking done with sklearn models
  • An important thing for a data scientist to have before any of the stuff above is a good understanding of statistics, Elements of Statistical Learning is a detailed survey of the statistical underpinnings of machine learning.

Learning from machine learning: deliberate practice

In my downtime, I’ve been using Kaggle to get better at applying machine learning to solve problems. The process is not only teaching me new technical skills, but also reminding me of some useful principles that can be applied elsewhere. To keep things digestible, this is the first post of two.

Deliberate practice, with Kaggle

Deliberate practice–practice that is repeatable, hard, and has fast feedback (e.g. with a coach)–is needed to master any skill. Kaggle provides a great medium for machine learning deliberate practice: you can still solve the problems that were for old competitions, read about what the top performers did, and get instant feedback on how well your machine learning model performed vs. other peoples’.

Screen Shot 2016-05-27 at 5.40.51 PM

Aside from accessible deliberate practice, self-learning this way has another big benefit over some of the in-person data science/machine learning classes I’ve observed: the student has control. I can learn as fast or as slow as I need to. I can learn about what I want: not only about what I find most interesting, but about what the top performers on Kaggle and other experts are doing to be successful.

I attempt to solve a machine learning problem on Kaggle, see how I performed, read about and take notes on what the top performers did, and fill in my knowledge gaps with lots of research on Google, continuously cycling between writing down questions about new terms or concepts that come up and answering them. The self-paced, deliberate nature of this learning avoids what Sal Khan calls “Swiss cheese gaps” in education–though of course, it is up to the learner him/herself to stay disciplined and engaged.

Screen Shot 2016-05-29 at 8.29.14 PM

The “cycle” of deliberate practice described. Important things to note: it is closed, which allows for the learning from feedback, and it is fastwhich allows for that learning to happen quickly, and to be timely.

Something like Khan Academy provides a great structure for self-paced, deliberate-practice-oriented learning for more “traditional” academic topics. I see opportunity for more things like it, in other educational areas. Also, if anyone has found any helpful tools for self-learning, would love to hear about them. I personally use a lot of Google Docs for note-taking, mind42 for topic hierarchies, pinboard to keep track of my online research, sometimes Quizlet to help me memorize things.

Next: 80/20-ing machine learning

In the next post, I will get slightly more technical and into some of the “highest leverage” machine learning concepts and skills, as well as share some resources (including advice from one of the most helpful machine learning educators and practitioners I’ve had the pleasure to interact with). There should also be at least one principle/mental model for those less interested in the technicals of machine learning. As always, please be critical and feel free to discuss anything and everything, I love learning from other perspectives.

My Attempt To Make Clinical Trials More Efficient

cr net screenshot

For a few months, on nights and weekends while working at my most recent job, I worked on a project to help make clinical trials more efficient, and even built a prototype (the screenshot above, you can play around with it here)–I gave it the memorable and exciting name “Clinical Research Network”.

Though my project didn’t “succeed” in the traditional sense, I learned a lot about this interesting area of health/biotech, and got to practice several important product development skills. The following are the important parts of my story, but warning, it’s still a long post.

Clinical trials have a hard time recruiting enough patients, which causes a lot of waste.

I received an email from HeroX one day about a competition to see who could come up with the best idea to help clinical trials recruit more patients. Intrigued, I did more research on the problem, and decided to enter the competition: worst case I would spend a little time writing a proposal that didn’t win, but still get to learn more about this fascinating problem.

As discussed in a previous post, roughly 10% of clinical trials terminate unsuccessfully because they’re unable to recruit enough patients for the study. There are roughly a thousand new clinical trials every year, and since a clinical trial costs on average $30M-$40M, a lot of money is spent on clinical trials that don’t end up contributing much to the advancement of science and medicine.*

The HeroX competition’s more quantifiable goal was to come up with ideas that could double the patient recruitment rate from 3% to 6%, patient recruitment rate being defined as number of patients who participate in clinical trials / total number of patients out there. The more patients participate in clinical trials, the faster medical research accelerates.

*The numbers used to “size up” the problem are very rough, and taken from various sources. My model also did not account for the fact that a lot of clinical trials that do complete successfully still have trouble recruiting patients fast enough, so go way over-schedule and over-budget. But the order of magnitude should be close. See the model for more details.

Questioning assumptions, asking why

The problem was framed so that solutions tackling recruitment first came to mind e.g. increasing patient awareness of clinical trials through tools, advertising, etc., connecting patients to clinical trials automatically by leveraging EMR data.

But I wanted to understand the problem at a deeper level, vs. taking things at face value. I put together a simple model in Google Sheets and let the numbers shed some light on the problem. Interestingly, even if all clinical trials were able to recruit enough patients with a wave of a magical wand, the patient recruitment rate would only increase by 4%, much less than the competition’s desired 100% increase, or doubling, of the patient recruitment rate. This suggests that if we really want to accelerate medical research and get more of the patient population to participate in clinical trials, we’re not only going to need to recruit patients better, but we’ll also need a lot more clinical trials, clinical trials that happen faster and more efficiently.

Screenshot of Patient Recruitment Model
Screenshot of Patient Recruitment Model

I wrote a proposal for the competition, submitted it, and…

What idea did I submit?

An idea for a SaaS product that would mine/learn from all the data we have on previous clinical trials (a lot of it public), and help pharmaceutical companies and investigators learn from the past. This product would essentially be a search engine on top of a “similarity graph”, where pharma and/or doctors/investigators could describe their clinical trial, and see other trials that were similar in some way (perhaps disease treated, or similar inclusion/exclusion criteria), and learn from what made those clinical trials succeed or fail.

Why did I submit that?

  1. There’s a lot of data out there on clinical trials, even publicly available data like There has to be some sort of knowledge we can learn from all the clinical trials we’ve already conducted, from both the successes and failures.
  2. Clinical trials face many different obstacles to recruiting patients, mostly because they themselves are very different–different populations, different diseases, different treatments, different investigators running the trial, different locations. But this doesn’t mean that trials aren’t similar to other trials in some way, so something that worked for one trial could also work for another, depending on how they’re similar.
  3. As mentioned before, I realized that the actual clinical trial process needs to be faster, more efficient, and cheaper to drive a meaningful acceleration of medical research. This was a tool that pharma and investigators/doctors could use to both plan and run a clinical trial more efficiently.

My idea didn’t win any of the prizes for the competition, but that’s ok.

If interested, you can see the winning entries (as well as the “top 10”, not sure where all the other entries went).

Getting out of the office

I asked for feedback on how my entry was judged, but didn’t get anything back. Still following my curiosity for the problem, I decided to talk to more people actually involved in clinical trials–I had originally found out about the competition two weeks before the deadline, so given some more time I felt I could come up with something more useful.

I developed a script to scrape for investigator contact info, and was able to gather a good list of physicians in the NYC area. I also used Mechanical Turk to fill in what I wasn’t able to scrape, such as a doctor’s research institution. After writing a bunch of emails to request to meet, one doctor actually got back to me! After that it was a bit easier, as I would ask the doctors if they knew anyone else I could talk to, and also name-drop the institutions I had visited already. I got to speak to a couple ex-pharma individuals from this effort too.

The two biggest things I learned from speaking to the handful of physicians and ex-pharma folk:

  1. Physicians don’t really talk to and learn from each other when it comes to clinical trials, e.g. about patient recruitment best practices. They’re extremely busy, and there isn’t really an incentive to help another physician who may be seen as a “competitor” (both in terms of revenue and research).
  2. Though investigators (physicians) recruit patients for a clinical trial, pharma and “contract research organizations” (CROs) recruit the investigators to run a clinical trial (among a ton of other stuff to set up and support the trial). It seemed that industry’s methods for investigator selection were pretty manual: they would rely on their own personal, immediate networks, maybe look at which investigators they worked with in the past.

Building something fast

I decided to build an MVP that was based on my learnings. There’s a lot that can be improved in the clinical trials process, so I thought about leverage, and a decision tree: decisions made earlier in a process can have a big impact on the decisions made later. This early task of “investigator selection” that pharma does when setting up a clinical trial (point 2) sounded like a good one to try and tackle with technology. It also isn’t something that investigators themselves are super concerned with, which would get around the obstacles discovered in point 1. There’s a lot of public data out there on clinical trials ( and research that came out of the trials (PubMed), so I wanted my tool to leverage this data.

I threw together something really quickly using Flask, the python framework. Use cases: pharma could type in a drug and find the researchers who published the most research on that drug–those physicians might be good candidates as investigators for a clinical trial that used that drug (to perhaps treat a different disease). Patients could type in the disease they had and find the physicians who were perhaps the most knowledgable on that disease. On the backend, data was scraped from PubMed, and essentially just restructured to be more useful for this particular case.

I started showing the “Clinical Research Network” to people in the biotech space to see what they thought…

The end?

…and I quickly found out that several companies, both small and large, were tackling this exact problem. They had way better credentials, more money, and free snacks at the office–how can I compete with free snacks?

So I put this project on hold, mulled over the possibility of working for them, and decided to move onto other ideas I was thinking about. I like writing post-mortems for my projects, and one of the biggest learnings was that I seemed to have “overextended” myself in a sense: I felt like my struggle was a very steep uphill climb from the beginning because I didn’t have the industry credentials and I didn’t yet have the industry network, very important aspects in an industry like biotech and healthcare.

Overall, the project was a great learning experience, and I got to practice several problem solving skills I find powerful and fun.

Pharma Paid Physicians $6.5B in 2014 – Looking Into The Open Payments Dataset

My friend Jesse introduced me the Open Payments Dataset, which tracks the details of all payments made by “applicable” healthcare manufacturers (like pharmaceutical companies, medical device manufacturers) to any doctor they work with. A federal program maintains this database, which is a product of the Sunshine Act, part of the Affordable Care Act.

Why does this database exist? Basically because of the incentives created by industry being able to pay doctors to work on things that will ultimately help industry–like new drugs or medical devices. The hope is that more transparency will reduce any harmful influence that industry could have on medical research, education, and clinical decision making. In the words of Senator Grassley, co-author of the Sunshine Act:

Disclosure brings about accountability, and accountability will strengthen the credibility of medical research, the marketing of ideas and, ultimately, the practice of medicine. The lack of transparency regarding payments made by the pharmaceutical and medical device community to physicians has created a culture that this law should begin to change substantially. The reform represented in the Grassley-Kohl Sunshine Law is in patients’ best interest.

The healthcare industry pays physicians a lot, almost $6.5B in 2014 alone. What is being paid for though (or, what does industry report the payments are for)? Who’s getting paid, and how much? I decided to do a quick analysis to start answering these questions and to see if there was anything interesting at a high level.

Most top paid physicians get paid royalties or license fees

The most a single physician got paid in 2014 was almost $44M. The interesting thing is that for this physician and several other top paid physicians, almost the entire total came from payments that were categorized is this unhelpfully-named category, “Compensation for services other than consulting, including serving as faculty or as a speaker at a venue other than a continuing education program” (orange).

A large majority of the other of the top paid physicians got paid primarily from “Royalty or License” (green), which makes sense: a surgeon may invent a new surgical technique and license it to a medical device company.

Another interesting phenomenon is that a handful of doctors in the top 100 earners were paid by industry solely for their research (purple). The status quo of industry having all the money and thus paying/funding research–sometimes both the design of and execution of the research–can create incentives with negative consequences for the validity of the results.

You can play around with the charts like the one below by zooming, mousing over data points to see their values, and showing/hiding different data series by clicking on each one in the legend. Physician names have been replaced with numbers for anonymity.

Chart embedded below, or link

Orthopedic surgeons received the most industry payments, followed cardiovascular physicians

Orthopedic surgeons received the most money from industry, almost twice the amount that cardiovascular physicians received, in 2014. Interestingly, most of payments to orthopedic surgeons, and other types of surgeons, were for royalties or licenses (green), whereas most payments for physicians–cardiovascular and otherwise–were for “Compensation for services other than consulting” (orange), “Research” (purple), and “Consulting” (purple).

Click to show interactive chart (some labels are crazy long so embedding didn’t look good. “A&O” stands for “Allopathic & Osteopathic Physicians”):
Payment Received by Physician Specialty in 2014 (Top 50)

The healthcare industry pays a lot of money for research

Out of the $6.5B total payments to physicians in 2014, $3.2B, or almost half, of those payments were for research. We can see this when aggregating the payments by the name of the drug or device manufacturer: companies like Genentech, Pfizer, and Novartis dominate the dollar amount of payments made to physicians, and most of their payments are for “Research” (brown). Further down the line, you can see medical device manufacturers like Stryker and Medtronic paying physicians mostly for “Royalty and License” (green).

Click to show interactive chart:

Payment Sources in 2014 (Top 50)

Physicians in CA received, by far, the most amount of money from industry.

The graph below shows how much money physicians received for research and “general” payments (any payment that isn’t classified as “Research”), grouped by the state they work in; the size of each bubble represents the number of physicians in that state.

CA had significantly more physicians receive payments (8081) than the runner-up state, NY (5981), and thus the physicians that worked in CA received a lot more money from industry, in aggregate.

Payments Received by State
Though drilling into state by state differences in the data (e.g. the dominant “purpose” CA physicians vs. physicians in other states get paid for) is an exercise for another time, we get a hint for why this phenomenon might exist by looking at the teaching hospitals that were affiliated with the physicians who got paid by industry the most.

Click to show interactive chart:

Payment Sources in 2014 (Top 50)

Physicians affiliated with the City of Hope National Medical Center in Los Angeles received the most industry payments, by far, and almost all if it from royalties or license fees (green). Genentech has been known to pay massive royalties for the drugs developed at City of Hope, including the crazy expensive cancer treatments Herceptin and Avastin.

Do physicians get rewarded with fancy dinners and extravagant trips?

By looking at the data, we can find which physicians got paid the most for “Entertainment”, “Food and Beverage”, and “Travel and Lodging”. But we won’t know for sure, because remember, all this payment data is reported by the healthcare industry themselves, and while there are some financial penalties for inaccurate reports, I don’t see an easy way for the government to verify the validity of the data.

The “worst offenders” were essentially given, by industry, $60 meals three meals a day for every day of the year, went on $590 per day trips, and spent $43 a day (about $300 a week) for entertainment and fun. Sounds like the life (except a little more on the entertainment and fun please).


There’s a lot of money being transferred from the healthcare industry to physicians, which means a ton of data since all of this has to be reported now. In fact, I didn’t even touch another part of the dataset, how much ownership each physician has in a particular drug or device manufacturer, which could give even more color on misaligned incentives. Also, without aggregation of some of the data fields, the raw, transaction/payment level data took up close to 6GB of space, and I didn’t want to spin up a Spark cluster or something. Luckily, the Open Payments site provides a web service that allowed me to aggregate and filter the raw data, dramatically reducing the dataset’s size.

With the Sunshine Act being first introduced in 2007, then shot down, then enacted as part of the ACA in 2010, and with the Centers for Medicare and Medicaid Services (CMS) now responsible for collecting this data on top of everything else it does, hopefully we find some useful applications for the Open Payments dataset.

This analysis and post were done pretty quickly, many thanks to Carol for giving me some immediate ideas and feedback! And to iPython Notebook, and the pandas and plotly libraries.

Learnings from being on my own

Ernest was a baller.

I’m always looking to learn and grow as much as I can, and so am now working for myself. I’m currently consulting for other businesses, doing product development and/or data analysis, since I have a generalist software + statistics background. I see it as a great way to work with different, awesome people, on different problems, while learning about different industries: it’s a way for me to take lots of little bets in my journey of doing interesting things, finding my passion/what I want to focus on, and becoming the best version of myself.

Here are some of the biggest things I’ve learned so far, even though it’s only been a short amount of time. Hopefully they are helpful and mostly generalizable, but everyone’s life is different so your mileage may vary.

1. Reflect on when in your life you’ve felt happiest and most fulfilled.

I looked back on my life and thought about when I really felt the most alive, happy, and fulfilled. For me, it came down to experiences where I manifested my dreams, despite any perceived risk. Of course, I could not have done it without the help and support of friends and family and partners-in-crime–I feel life is so much less meaningful without others–but it was not being dependent on anyone but myself in taking action to maximize my potential that made me feel fulfilled*.

For example, one of the first pieces of software I ever developed was a math flashcards application built in Visual Basic, with cheesy cartoon characters and everything. As a middle schooler who had just learned how to program, I was super proud of it and really excited whenever I got to work on it, because I had come up with the idea and it was up to me to manifest and build my own “dream”.

Another time when I felt happy and fulfilled was the period of a year or two of learning how to pick up girls. That itself is a story for another time, but again, I loved the experience of facing and overcoming perceived risk, via action, to become the best version of myself. There’s no doubt that I felt a lot of discomfort in a countless number of situations. But, especially in situations where the perceived risk is high but the real risk is low, the pain of regret usually hurts more than the pain of failure.

As a result, my overarching goal in life is to maximize the time I spend on these types of experiences.

What experiences have made you feel the most fulfilled in life?

2. Think about death.

Jeff Bezos, Steve Jobs, the ancient Stoics, and many others have used the tactic of thinking about death when examining life.

I like Bezos’s thought experiment the best for decision making, and I use it all the time: visualize that you are old and on your deathbed–would you regret having made decision A vs. decision B (vs. decision C, etc.)?

We all die someday. The inevitability is out of our control. So why not try to live the best life you can live?

3. Do things that make you happy, every day.

About a week after leaving my job, one random a day, I felt like I was in a deep rut: negative emotions like fear and self-doubt were spiraling out of control in my head. I needed to change things up–being in such a bad mood wasn’t moving me forward in life at all.

Taking 10 minutes to meditate helped (Tara Brach has some great guided meditations, Headspace is also great for beginners).

I hadn’t listened to any music in several days, so I put on some EDM, changed my environment a little, and cranked on work for a bit at a coffee shop. Those of you who’ve worked in a library and/or coffee shop before, it’s strangely motivating isn’t it?

I went to the gym in the late afternoon, which also helped because it took my mind off negative emotions and gave me sense of progress.

Later that night, I went to an event met new people. It was great to put myself in their shoes for a little and understand what they’re up to, and what they care about most.

Thanks for reading!

The new journey has only just begun, but those are the practices and mindsets I’ve implemented that have helped me so far. As always, advice is useless if you don’t internalize it, make it part of your mindset, and practice it.

Have a safe and relaxing holiday season!


*Reminds me of Rand’s Objectivism, I guess

What I learned from my side project in education technology: Formata

Screenshots of what the student would see, taken from the deck I sent to teachers.

Last winter, I built an MVP for an ed-tech product, called Formata. Here’s what it was, why I did it, and what I learned from it.

Why Education

I had been (and still am) trying little side projects in different industries because I like learning about and understanding new things. At the time, I had done some stuff in productivity and fintech, and I knew I wanted to have an impact on education eventually in my life. It’s been so influential on me and and is a huge lever to get us closer to what I call “opportunity equality” worldwide, so I decided to do a small project in education this time.

Principles of Educational Impact

I did a little thought experiment: I imagined myself as a middle school kid again, and thought about what influenced me the most, in my education. “My teachers” was the answer. Students spend the majority of their week day in school, and it’s the teachers that interact with them, and understand each and every child. I saw it first hand on a farm on the other side of the world: way more than the facilities and the curriculum, it’s the teacher that inspires the student and really has an impact on him or her.

Next, I asked, “Ok, so if teachers have the most impact on a child’s education, what makes a good teacher? What does “good” even mean? And how do you measure it?” I did some research, and came across the Gates Foundation’s Measures of Effective Teaching project, a project backed by hundreds of millions of dollars and pursuing these exact questions. Awesome!

Some more research led me to the interesting and sometimes controversial world of teacher evaluation. Traditionally, teachers have been evaluated by two methods: student test scores (also known as “value added”), and observations by someone like the principal. The thought is basically that student test scores, as the outcome of a teacher’s teaching, should correlate with his or her teaching ability. Sometimes, administration has a rubric for what they think makes a teacher good, and so a few times a year, the principal might sit in on a class for 15 or so minutes to observe and evaluate the teacher.

There are some fundamental issues with both methods, which I’ll mention briefly. It’s hard to see the principal observing each teacher a few times a year, for 15 minutes, having any strong relationship with how good the teacher actually is. The Gates Foundation has done research that shows that teacher observations are less reliable than test scores; however, tests on which teachers are usually evaluated (usually state-wide standardized ones) only happen once every year, and if they know this is tied to their employment, there’s a strong incentive to “teach to the test”.

Who interacts with teachers the most? Who would be best at evaluating them? The students themselves. Again, the Gates Foundation did a bunch of research on what exactly students should evaluate teachers on, sort of quantifying the aspects of a good teacher. They narrowed the most important characteristics down to what they called the “7 C’s”: caring, control, captivate, clarify, confer, consolidate, and challenge. Structured in the right way (e.g. low-stakes and anonymized, so the students aren’t incentivized to fudge), student perception questionnaires that asked about these characteristics were pretty reliable in discerning high performing teachers from the rest.

Building A Product

I noticed that in the Gates Foundation’s research, the student perception surveys were being administered with pen, paper, envelopes, stickers, etc. I felt like the surveys could be administered much more efficiently with technology; the results could also be tabulated and organized much better for teachers and administrators to learn from.

To further validate my idea, I went to a bunch of ed-tech meet-ups, talking to teachers and asking them what they thought about my idea. They all agreed that having more feedback, more frequently, on their teaching would be helpful.

I thought this was a pretty quick MVP to build, I could even do some of the analysis of feedback for the teachers manually myself at first. All the teacher would have to do was give me the email addresses of his/her students, and I could auto-generate emails and questionnaires, send them off, and aggregate the results.

Visualizations of student feedback I could generate for teachers, so they could pinpoint where to work on
Visualizations of student feedback I could generate for teachers, so they could pinpoint where to work on

Moving On

After a month of reaching out to teachers, those who I already met or knew and also those who I didn’t, and sending them my slide deck about Formata and its benefits, I finally got a few who said they were willing to try it. They were extremely busy though (all teachers are overworked), and had to get permission from their department heads, who had to get permission from the principal, to use it. Their effort fizzled out, and I did a re-evaluation of my own time, and moved on.

What I Learned

I learned about a lot of different things, but overall, I think this project reinforced two principles for me:

  • Ask better questions when doing customer development, and solve a problem.
    • My idea never really solved an important problem for my target audience, teachers. I should’ve talked to more administrators, who may care more about teacher evaluation. Also, you’re bound to get positive but not very useful answers when you ask someone what they think about your idea: whether it solves a big enough problem for them to actually integrate your product into their life is a different story. Not solving an important enough problem for teachers coupled with lots of bureaucracy and the fact that they’re overworked was not a recipe for excited users.
  • Keep doing things, don’t worry about failure.
    • I got to learn about an important and fascinating area of education by doing this project. I also got to learn about the realities of the space. I learned more about the power of customer development: that through observation and/or asking better questions, you can get to true pain points that people will pay you to solve. I learned that some types of problems and tasks excite me more than others. This project was also a great way for me to practice first principles thinking.

Thanks for reading this journal of sorts.

Cancer clinical trials and the problem of low patient accrual

Inspired by this contest to come up with ideas to increase the low amount of patient accrual for cancer clinical trials, I decided to look more into the data. Bold, by the way, is one of my all time favorite books, and was co-authored by the creator of the website, the xprize Foundation, and co-founder of Planetary Resources: Peter Diamandis. Truly someone to look up to.

Anyways, the premise of the contest is that over 20% of cancer clinical trials don’t complete, so the time and effort spent is wasted. The most common reason for this termination is the clinical trial not being able to recruit enough patients. Just how common is the low accrual reason though? And are there obvious characteristics of clinical trials that can help us better predict which ones will complete successfully, and what does that suggest about building better clinical trial protocols? I saw this as an opportunity to explore an interesting topic, while playing around with the trove of data at and various data analysis python libraries: seaborn for graphing, scikit-learn for machine learning, and the trusty pandas for data wrangling.

Basic data characteristics

I pulled the trials for a handful of the cancers with the most clinical trials (completed, terminated, and in progress), got around 27,000 trials, and observed the following:

  • close to 60% of the studies are based in the US*
*where a clinical trial is “based” can mean where the principal investigator (the researcher who’s running the clinical trial) is based. doesn’t give the country in which the principal investigator’s institution is in, so as a proxy, I used the country which had the largest number of hospitals the study could recruit patients at.
  • almost 25% of all US based trials ever (finished and in progress) are still recruiting patients


  • of those trials that are finished and have results, close to 20% terminated early, and 80% completed successfully (which matches the numbers the contest cited)


  • almost 50% of all US based trials are in Phase II, almost 25% are in Phase I


  • and interestingly, the termination rate does not differ very significantly across studies in different phases


Termination reasons

Next, I was interested in finding out just how common insufficient patient accrual was as a trial termination reason vs. others reasons. This was a little tricky, as gives principal investigators a free-form text field to enter their termination reason. So “insufficient patient accrual” could be described as “Study closed by PI due to lower than expected accrual” or “The study was stopped due to lack of enrollment”. So I used k-means clustering (after term frequency-inverse document frequency feature extraction) of the termination reasons to find groups of reasons that meant similar things, and then manually de-duped the groups (e.g. combining the “lack of enrollment” and “low accrual” groups into the same group because they meant the same thing).

I found that about 52% of terminated clinical trials end because of insufficient patient accrual. This implies that about 10% of clinical trials that end (either successfully, or because they’re terminated early) do so because they can’t recruit enough patients for the study.


Predicting clinical trial termination? provides a bunch of information on each clinical trial–trial description, recruitment locations, eligibility criteria, phase, sponsor type (industry, institutional, other) to name a few–which begs the question: can this information be used to predict whether a trial will terminate early, specifically because of low patient? Are there visible aspects of a clinical trial that are related to a higher or lower probability that it fails to recruit enough patients? One might think that the complexity of trial eligibility criteria and the number of hospitals from which the trial can recruit from could be related to sufficient patient accrual.

Here was my attempt to get at a solution to this question analytically: fitting/training a logit regression multi class classifier–whether a trial would be “completed”, “terminated because of insufficient accrual”, or “terminated for other reasons”–on a random partition of clinical trial data, and measuring its accuracy at classifying out-of-sample clinical trials. The predictors were of two types: characteristic (e.g. phase, number of locations, sponsor type, etc.) and “textual”, or features extracted from text based data like the study’s description and eligibility criteria. Some of these features came from a similar tf-idf vectorization process as described in the k-means section above, other features were the simple character lengths of these text blocks. Below is a plot showing the relationship between two of these features: length of the eligibility criteria block of text, and length of the study’s title, two metrics that perhaps get at the complexity of a clinical trial.


The result: the logit model could only predict correctly whether trials would complete successfully, terminate because of low accrual, or terminate for other reasons 83.6% of the time. This is a pretty small improvement over saying “I think this trial will complete successfully” to every trial you come across, in which case you would be correct 80.6% of the time (see the Completed vs. Terminated pie chart above). Cancer clinical trials are very diverse, so it makes sense that there don’t seem to be any apparent one-size-fits-all solutions to improving patient accrual.