Tuesday, March 18, 2014

The Pantheon


I was reading my Sunday New York Times when my heart skipped a beat.   There in the magazine was an article ``Who's More Famous than Jesus?'' which had to, just had to, be about our Who's Bigger rankings.

Well, it wasn't.  A project at MIT called Pantheon was the source of the article.   Pantheon also uses analysis of Wikipedia data to rank the fame of historical figures.    I will confess to a little sense of Schadenfreude in reading the comments complaining about theie rankings, including:

  • Their bias towards Americans in particular and the Western world in general.
  • That they contain too few women in highly ranked places
  • Gleefully pointing out occasional mechanical misclassifications of individuals (particularly problematic was identifying John Wayne Gacy as a comedian instead of a serial killer)
  • Making too big a deal of small differences between rankings of closely matched people
  • Complaining that Wikipedia is not a reliable source to analyze world culture.
This all sounded very familiar, because these comments have been made about our rankings as well.

It seems worthwhile to compare our rankings and methodology with that underlying Pantheon.  There are several differences between our approaches to using Wikipedia as a resource:
  • Languages -- Pantheon makes use of the multiplicity of Wikipedia language editions in its analysis.   To be ranked as truly famous one must appear in at least 25 different language editions.  This would make the rankings more inclusive of world opinion than our English-only analysis, although reader comments still complain about the Anglo-centric bias of the results.
  • Variables -- Of the Wikipedia variables we employ in our rankings (two forms of PageRank, hits, edits, and article lengths), Pantheon only employs page hits.   Thus their notion of Fame is more akin to our notion of Celebrity (which loads heavily on hits).  Gravitas is the other component of historical significance, which we found loading most heavily on PageRank.   Thus we would expect their rankings to over-emphasize popular culture ahead of ours.
  • Corrections for Time -- Pantheon employs an exponential decay model of fame in an attempt to correct for the recency bias of fame.   This overcompensates for the passage of time: six of the Pantheon top ten were ancient Greeks, with three others (Jesus, Confucius, and Julius Caesar) living 2,000 or more years ago.   The most recent member of the Pantheon top ten only gets us to the Renaissance (Leonardo da Vinci).  Our aging model is more sophisticated, and calibrated to appearances of names in 200 years of scanned books / Google Ngrams.
  • Validation -- Their website includes an analysis of how their rankings compare to performance in three sports domains: Formula 1 racing, tennis, and swimming.  Our book discusses how our rankings compare to sports statistics (particularly with respect to baseball), but we also perform a more general set of validation tests, including correlations against 35 published rankings, prices of collectables including paintings and autographs, and public opinion polls.
To their credit, their website is fun to play with and features a host of interesting visualizations.

But how good are the rankings?  It is easy to cherry-pick any set of rankings for things that look weird. They name Rasmus Lerdorf (developer of the programming language PhP, who frankly I had never heard of) among their top 11,000 people, on the strength of being in more than 25 Wikipedia editions (he is actually in 31).  By comparison, we have him as the 51,670th most significant figure.  They rank Justin Bieber at 671 to our 8633, and Johnny Depp at 203 to our 2739, suggesting an over-emphasis of celebrity at the expense of gravitas.

But the right way to compare rankings is through validation measures.   This takes work, but I hope we can do such a study soon.  We will report our results here when we do.