donderdag 20 januari 2011

Horoscoped

Horoscoped: "

Horoscoped - Do horoscopes really just all say the same thing?

Do horoscopes really all just say the same thing? We scraped & analysed 22,000 to see.


See our completed meta-horoscope chart and make up your own mind.


We’ve also created a single meta-prediction out of the most common words..




How we did it


Horoscoped - Scraping 22,000 horoscopes

How do you gather 22,000 horoscopes? Obviously you could manually cut and paste them from one of the many online Zodiac pages. But that, we calculated, would take about a week of solid work (84.44 hours). So we engaged the services of arch-coder Thomas Winnigham to do a bit of hacking.


Yahoo Shine kindly archive their daily predictions in a simple and very hackable format (example). Thank you! So Thomas wrote a Python script to screen-scrape 22,186 horoscopes into a single massive spreadsheet. Screen-scraping is pulling the text off a website after it’s displayed. Python is a programming language. You can use it to write scripts that only gather the specific text you want. Then you run it multiple times so it mines an entire website.


Well, it’s not quite that easy. Big sites like Yahoo have ‘rate-limiting’ on their servers. That means if you access a page too many times too quickly, it thinks you’re a hacker and deploys all kinds of anti-hacking counter-measures. Initially, Thomas set his scraping speed too high (once every 10th of a second) and his IP got instantly banned from Yahoo for 24 hours. After some experimenting (and more bans), he found that a two second delay between scrapes prevented the defense mechanisms from kicking in. The script was set to run in the background (while we smoked cigars and discussed the empire). 12 hours later, we had our 22,000 horoscopes in a single file!


We can’t share the 9.5MB spreadsheet with you because it’s Yahoo’s copyright. But here are the Python scripts should you feel like recreating the experiment.


https://gist.github.com/776219

https://gist.github.com/776228


Filtering it down


Horoscoped - Filtering 22,000 horoscopes

So every different type of horoscope got sucked up – career, teen, love, daily overview. Who knew there were so many? It was felt, though, that career & love predictions would have their internal biases i.e. lots of mentions of work, career, love, marriage etc. So we opted to just analyse the generic daily horoscopes for each sign. A total of 4,380 (365 per star sign).


Word Analysis Version 1


We used an online tool called TagCrowd to find the most common words. I prefer it to Wordle. You’ve got better control over any ‘noise’ in the signal, because you can not only filter common words (“and”, “for”, “is” etc) but also a special ‘stoplist’ of words you’ve chosen.


So we broke down the most common 50 words to see if there are any patterns of unique words. This is what was revealed:


Horoscoped - Unique words in top 50 words in predictions of each star sign


You can see the full data in a Google spreadsheet here.


Word Analysis 2


It struck me that several words in the top 50 – like “someone”, “really”, “quite” – were just qualifiers and not really that revealing. You’d find them in any English word analysis.


So we stripped those kinds of words out (see our stoplist). And lo! A fresh set of unique, revealing and more accurate words appeared in the top words per sign.


Horoscoped - Unique words in top 50 words in predictions of each star sign


Can I just say that I have no personal interest in horoscopes. I don’t know what the various characteristics of each star sign are meant to be. So you’ll have to tell me if any of this corresponds to folklore.


This was the data we used to create our meta-chart. Check out the final image. Or see all the data in this Google spreadsheet.



Meta-Prediction


One more thing though. This analysis appears to reveal something. The bulk of the words in horoscopes (at least 90%) are the same. That’s not a full, proper statistical analysis. (If you are a statistician and you want to do a proper analysis, please get in touch)


The cool thing is, once you’ve isolated the most common words, you can actually write a generic, meta prediction that would apply to all star signs, every day of the year. Here it is.


Horoscoped - Meta-prediction made from most common words in 4,000 star sign predictions


The Future


As ever, I’ve laid out my whole process and all the data here: http://bit.ly/horoscoped.

That way it’s all balanced and you can make up your own mind. Typical Libran!





Concept & research & design: David McCandless

Additional design: Matt Hancock

Additional research: Miriam Quick

Hacking: Thomas Winningham

Source: Yahoo Shine Horoscopes

Code & Scripts: Here and here

Data & workings: bit.ly/horoscoped





"