Scraping Horoscopes
Astrology drives me bananas. Don’t get me wrong - I understand why people enjoy it, and why someone might check their horoscope regularly. People do lots of weird things, but there’s always some underlying set of psychological explanations for these behaviors. For the astrologically inclined, the idea that the world is governed by the positions of planets and stars is a comforting one. At least, it’s a hell of a lot more comforting than the truth - that the world is a pretty chaotic place, and trying to understand it or make specific predictions (about the behaviors of individual agents, for example) is close to impossible.
But actually buying into it? And actively arguing for their veracity? Come on.
Anyway, I wanted to practice doing some web scraping and some simple modeling of text data. This post is about the former. I thought astrological forecasts, being published daily, would make a great thing to look at. It was a little difficult to find historical astrological forecasts. I guess this shouldn’t be surprising. How many horoscope readers really want to go back to last week’s horoscope to see if the predictions came true? Probably vanishingly few.
Fortunately, the New York Post does something other than make awesome puns - they have their horoscopes archived going back to at least December 2013. Even better, they do it with nicely formatted URLs. For example, here’s the urls for Aries from 12/01/2013 to 12/05/2013
http://nypost.com/horoscope/aries-12-01-2013/ http://nypost.com/horoscope/aries-12-02-2013/ http://nypost.com/horoscope/aries-12-03-2013/ http://nypost.com/horoscope/aries-12-04-2013/ http://nypost.com/horoscope/aries-12-05-2013/
Given those five examples, I’m sure you can guess what the link is for any given horoscope on any date between 12/01/2013 and the present day. Because they’re so regularly formatted, that means that it wont be hard to write a little script to go to each webpage and pull out the information we’re interested in.
There’s two other considerations we’ll need here. First, if you’ve spent much time surfing the internet, you know that sometimes pages don’t work. You have to refresh for the page to load, or the page is inexplicably removed. If I’m going to automatically grab text from over 6000 individual pages (12 signs * ~1.5 years of horoscopes), I need to build in something to account for the fact that not every page I visit will be a success.
Second, once we successfully visit a page, I’m not going to be interested in all of it. There’s a lot of advertisements and links to other punny articles and buttons to like stuff or subscribe to something and all the glorious, chaotic noise that is the internet. I need some way of wading through this mess and pulling out only the information I want.
To solve the former problem, we’ll turn to the python module requests. This module gives us a nice way to ping a webpage and see if it answers back.
Now, this new variable, page, has a method that tells us if the page responded appropriately:
If, on the other hand, we try to visit a bad page, we get something slightly different:
Great! We’ve solved the first problem. For the second, we need two things.
First, we need to know what part of the webpage we want. Because most of the stuff you see on the internet is composed of rigidly constructed html, this means that we can use that structure to automatically select the information we want. There’s two ways I do this - selector gadget or using Chrome’s developer tools (open chrome - go to View -> Developer -> Developer Tools). If we look at our horoscope pages, we see that the content we’re interested in (the text of the horoscope) is contained within an element ‘p’, which is itself contained within an element ‘div’ with the label ‘entry- content’.
Now that we know where our wanted content is, we can use the wonderfully named beautiful soup module to grab the information we want.
Confession: These two things always involve a lot of trial-and-error for me. I’m no expert on this stuff. I just use them as tools to greater ends
There we are! That’s the stuff we wanted! Now we can put all this stuff together to scrape all the data we want. I tried to do this a couple of times, but the tool broke down after a couple of hours, so instead of doing it in a big loop like I pasted below, I did it sign-by-sign over the course of a couple of days.
Anyway, here’s the whole thing. If you want the compiled data, you can find it in pipe-delimited format here. Next time, I’ll do some actual work with these data.