One of the side projects I worked on in the past handful of months was Mr. Market Feels: a stock market sentiment Twitter bot that used automated image processing to extract and tweet the value of CNN Money’s Fear and Greed Index every day.
There have been attempts to backtest the predictive power of the Fear and Greed Index when buying and selling the overall stock market index depending on the value (the results suggest there isn’t much much edge for that particular strategy). Anecdotally though, I’ve found the CNN Fear and Greed Index (what I’ll call FGI for short) to be a pretty good indicator of when this bull market has bottomed out during a short-term retracement, and when I used to have more time, have used it to trade options with decent success. Going to CNN’s website every day to check the FGI was a pain, and I also wanted the numerical values in case I wanted to run some analyses in the future, so I wondered if I could automatically extract the daily Fear and Greed Index values.
I saw this as a fun and short coding project that would help me and others while giving me practice with image processing, so I dove in.
The goal was to extract the FGI “value” and “label” from CNN’s site every day. The value of the Index is 53 and the label is “Neutral” in the snapshot of the FGI below:
Extracting the FGI value and label isn’t as easy as using OCR (optical character recognition) on the image and getting the results: for one, there is a lot of extraneous text in the image. Two: the pixel location of the value and label that we want changes as the FGI changes. Three: the relative position of the value and label also changes as the FGI changes. You can see points two and three in the image below: now, the FGI label (“Extreme Fear”) is to the top left of the FGI value (1). In the original image, the FGI label (“Neutral”) is directly right of the FGI value (53).
Why does all of this matter? Because for clean OCR, images need to be standardized. Or at least they do for Tesseract, the open source OCR engine created by Google. In Tesseract’s case, images of text shouldn’t contain any other artifacts (that the engine might try to interpret as text),should be scaled large enough, have as much image contrast as possible (e.g. black text on white), and be either horizontally or vertically aligned.
Most of the pre-processing of the FGI images, like the ones above, to standardize them for Tesseract was straight forward enough. Without going into way too much detail, I used the Python Pillow library to automatically convert the image to black and white, apply image masks to eliminate extraneous parts of the image–like the “speed dial” and the “historical FGI table” on the right hand side–and crop the image down leave only the FGI value and label, like this:
Here’s where challenge number three came up: the FGI value and label aren’t always either horizontally or vertically aligned, and this reduced Tesseract’s accuracy. For example, in the first image, the FGI label is diagonal from the FGI value. Running Tesseract OCR on it returns “NOW:[newline]Extreme[newline]Fear”, which completely misses the value “10” because of the diagonal alignment. You can try out Tesseract OCR with the above images, or with your own, here.
An Interdisciplinary Solution of Sorts
One solution to the challenge above split the resulting image into two images, one with the FGI value and a separate one with the label, so that Tesseract could be run on both and know that both images were either horizontally or vertically aligned. Basically, from a single FGI image, I wanted two images that looked like these:
In thinking about ways to implement that, I first thought about the principles of unsupervised clustering, from the field of machine learning. With clustering, the intermediate, processed FGI image could be segmented and split appropriately by finding the cluster of pixels that corresponded to the FGI value (“10”), and the other cluster of pixels that corresponded to the FGI label (“Now: Extreme Fear”).
Turns out that using the k-means clustering algorithm for image segmentation is pretty common practice.
First, a copy of the image was “pixelated” to ensure that the k-means algorithm would converge on the two correct clusters every time:
Then, applying k-means to find the centroids of the two clusters and deriving the separating partition resulted in an image that looked like this:
From there, the original black and white FGI image could be split along the partition line, which would result in the desired two images: one for the FGI value and one for the FGI label. From here, Tesseract would always have these two standardized images as inputs and would be able to cleanly extract the FGI value and label.
I just finished reading Poor Charlie’s Almanack (an amazing book full of wisdom and life principles) so Charlie Munger’s multidisciplinary approach to life is on my mind. Though this project was probably a little less multidisciplinary than he means because machine learning and image processing are closely related fields, I still saw it as an example of how broad and varied knowledge and skills can come together to solve a problem effectively. To quote Munger on specialized knowledge: “To the man with only a hammer, every problem looks like a nail.”
Also published on Medium.