This is a continuation of a series of blog posts in which I work with some
horoscopes I scraped from the New York Posts’s website. In the last post, I
showed that the language didn’t really contain any information that would allow
us to identify which sign the particular horoscope came from. However, that
doesn’t mean the language doesn’t contain any information.
Conveniently, we also have the publication date for each horoscope. Not only
that, but there are also 12 months of the year, just as there are 12
astrological signs. This means that it is easy and straightforward to compare
how well we can classify on zodiac sign (not well at all) with how well we can
classify on the month of the year.
First, let’s pull out just the month of the year from our data.
Now, we can repeat our classification procedure with this new set of labels that
indicate the month in which the horoscope was written.
HA! We know more about the month of the year than we do about the astrological
sign being discussed. Man my job is cool.
Just in case you don’t remember (or you never looked), here’s what this
classification would look like if there was no real relationship between the
horoscope and the month it was published. We can establish this by just
shuffling the labels such that they are randomly paired with horoscopes rather
than paired with the one that they truly belong with.
I would say that this pretty convincingly shows that there’s more information in
the horoscopes that pertains to the month of the year in which it was published
than the astrological sign.
Just to be complete, let’s use a random forest as well, just like we tried in
the last post.
A random forest seems to give us a bit better precision in this case, but the f1
score is the same. There’s a problem here, however. Unlike when we were using
horoscopes, our classes are not roughly equivalent in terms of the number of
instances. Specifically, there are fewer cases for the months of June through
November. This could be (and almost certainly is) biasing our learner and is an
important factor to consider when fitting these kinds of models.