Learning from machine learning: ensembling, and other important skills

In my downtime, I’ve been using Kaggle to get better at applying machine learning to solve problems. The process is not only teaching me new technical skills, but also reminding me of some useful principles that can be applied elsewhere. To keep things digestible, this is the second post of two (the first one is here).

A short list of important skills for a data scientist

When trying to get better at a skill, I try to tackle the highest leverage points–here’s what I’ve been able to gather about three skills that are important in being a data scientist*, from talking with others and reading about machine learning, and experiencing it firsthand with the client projects I do.

  1. Feature engineering
  2. Communication (includes visualization)
  3. Ensembling

The first two are relatively self-explanatory, ensembling brings some pretty interesting concepts that apply to decision-making, in my opinion.

*I’ll be referring to the “applier of machine learning” aspect of “data science”.

Feature engineering

Feature engineering is the process of cleaning, transforming, combining, disaggregating, etc. your data to improve your machine learning model’s predictive performance. Essentially, you’re using existing data to come up with new representations of the data in the hopes of providing more signal to the model–feature selection is removing less useful features, thus feeding the model less noise, which is also good. The practitioner’s own domain knowledge and experience is used a lot here to engineer features in a way that will improve the model’s performance instead of hurt it.

There are a few tactics that can be generally applied to engineer better features, such as normalizing the data to help certain kinds of machine learning models perform better. But usually, the largest “lift” in performance comes from engineering features in a way that’s specific to the domain or even problem.

An example is using someone’s financial data to predict likelihood of default, on a loan for example. You might have the person’s annual income and monthly debt payments (e.g. for auto loans, mortgages, credit cards, the new loan they’re applying for), but those somewhat closer to the lending industry will tell you that a “debt to income ratio” is a better metric for predicting default, because it essentially measures how capable the person is of paying of his/her debt, all in one number. After calculating it, a data scientist would add this feature to the training data, and would find that their machine learning model performs better at predicting default.

As such, feature engineering (and in fact, most of machine learning) is sort of an art vs. a science, where a creative spark for an innovative way to engineer a domain specific feature is more effective than hard and fast rules. They say feature engineering can’t be taught from books, only experience, which is why I think Kaggle is in an interesting position because they’re essentially crowdsourcing the best machine learning methodologies for all sorts of problems and domains. There’s a treasure trove of knowledge on there, and if structured a little better, Kaggle could contribute a lot to machine learning education.


What potentially useful features/data could we engineer from timestamp strings? We could generate year, month, day, day of week, etc. numeric data columns–much more readable by a machine learning model.


During a recent chat with one of the core developers of the Python scikit-learn package, I asked what he thought some of the most important skills for a data scientist are. I sort of expected technical skills, but one of the first things that came up was communication, or being able to convey findings and why those findings matter to both internal and external stakeholders, like customers. This one’s self explanatory–what good is data if you can’t act upon it.

In fact, it seems like communicating well for data scientists might be even more important than it is for professions like programmers or designers because there’s a larger gap between result and action. For example, with a design or app, a decision maker can look at it or play around with it do understand it reasonably well to make decision, whereas a decision maker usually can’t just see a bunch of numbers that were spit out by a machine learning model and know what to do: how are those numbers actionable, why should someone believe those numbers, etc. Visualization is a piece of this, as it’s choosing the right charts, design, etc. to communicate your data’s message most effectively.


In machine learning, an ensemble is a collection of models that can be combined into something that performs better than the individual models.

An example: one way this is done is via the voting method. The different base, or “level 0”, models each make a prediction on, say, whether a person is going to go into default in the next 90 days. Model A predicts “yes”, model B predicts “yes”, and model C predicts “no”. The final decision then becomes the majority vote, here “yes”.

There are many other ways of ensembling models together. An important and powerful one is called stacking, and it is applying another machine learning model–called a “generalizer”, or “level 1” model–on the predictions of the base models themselves. This is better than the voting method because you’re letting the level 1 machine learning model decide which level 0 models to believe more than others based on the training data you feed into the system, instead of arbitrarily saying “the majority rules”.



A high level flow chart of how stacking works.

Ensembling is a key technique in machine learning to improve predictive performance. Why does it work? We all have an intuitive understanding for why it should work, because it’s a decision making framework we all have probably used, or been a part of, before. Different people know different things, and so may make different decisions given a particular problem. When we combine them in some way–like a majority vote in Congress or at the company we work at–we “diversify” away the potential biases and randomness that comes from just following one decision maker. Then, if you add in some mechanism to learn which decision makers should have their decisions weighed more than others based off of past performance, the system can become even more predictive–what areas could benefit from this improved, performance based decision-making process?*

*Proprietary trading companies, where every trade is a data point and thus generated very frequently, do this more intelligent way of ensembling, in a way, by allocating more money to traders who’ve performed better than others historically. A trader who is maybe slightly profitable but makes uncorrelated trades–for example by trading in another asset class–will still be given a decently sized allocation, because his trades hedge other traders’ trades, thus improving the overall performance of the prop trading company. Analogously, in machine learning, ensembling models that make uncorrelated predictions improves overall predictive performance.


Here are some resources related to the topics described above that were recommended to me and that I found most useful, I hope they’re helpful to you too.

  • A good overview of the principles of data science and machine learning for non-technical and technical folk alike: Data Science for Business
  • Code example of stacking done with sklearn models
  • An important thing for a data scientist to have before any of the stuff above is a good understanding of statistics, Elements of Statistical Learning is a detailed survey of the statistical underpinnings of machine learning.