Thursday, March 20, 2014

Time Magazine's College Rankings

Time Magazine has launched an interactive feature ranking colleges by the prominence of the Wikipedia pages of their living graduates.   Harvard appears to be the top dog by this measure, just edging past Stony Brook (which again failed to make its way into the NCAA basketball tournament, the event which inspired Time's feature):

Their ranking methodology includes certain Wikipedia variables analogous to what we have used, including length and links in/out of the page -- which serves as a poor man's version of PageRank.  But PageRank is much better for meaningful notions of importance: links into a page only matter if they are from prominent individuals, and links out have little obvious meaning except that it should be correlated strongly with article length.

The other aspect of such an analysis is properly attributing alumni to schools.   The Wikipedia categories give fairly unreliable annotations, although after checking I can confirm that Pat Benatar in fact did attend Stony Brook for a year before dropping out.  I guess we "hit her with our best shot".

Tuesday, March 18, 2014

The Pantheon

I was reading my Sunday New York Times when my heart skipped a beat.   There in the magazine was an article ``Who's More Famous than Jesus?'' which had to, just had to, be about our Who's Bigger rankings.

Well, it wasn't.  A project at MIT called Pantheon was the source of the article.   Pantheon also uses analysis of Wikipedia data to rank the fame of historical figures.    I will confess to a little sense of Schadenfreude in reading the comments complaining about theie rankings, including:

  • Their bias towards Americans in particular and the Western world in general.
  • That they contain too few women in highly ranked places
  • Gleefully pointing out occasional mechanical misclassifications of individuals (particularly problematic was identifying John Wayne Gacy as a comedian instead of a serial killer)
  • Making too big a deal of small differences between rankings of closely matched people
  • Complaining that Wikipedia is not a reliable source to analyze world culture.
This all sounded very familiar, because these comments have been made about our rankings as well.

It seems worthwhile to compare our rankings and methodology with that underlying Pantheon.  There are several differences between our approaches to using Wikipedia as a resource:
  • Languages -- Pantheon makes use of the multiplicity of Wikipedia language editions in its analysis.   To be ranked as truly famous one must appear in at least 25 different language editions.  This would make the rankings more inclusive of world opinion than our English-only analysis, although reader comments still complain about the Anglo-centric bias of the results.
  • Variables -- Of the Wikipedia variables we employ in our rankings (two forms of PageRank, hits, edits, and article lengths), Pantheon only employs page hits.   Thus their notion of Fame is more akin to our notion of Celebrity (which loads heavily on hits).  Gravitas is the other component of historical significance, which we found loading most heavily on PageRank.   Thus we would expect their rankings to over-emphasize popular culture ahead of ours.
  • Corrections for Time -- Pantheon employs an exponential decay model of fame in an attempt to correct for the recency bias of fame.   This overcompensates for the passage of time: six of the Pantheon top ten were ancient Greeks, with three others (Jesus, Confucius, and Julius Caesar) living 2,000 or more years ago.   The most recent member of the Pantheon top ten only gets us to the Renaissance (Leonardo da Vinci).  Our aging model is more sophisticated, and calibrated to appearances of names in 200 years of scanned books / Google Ngrams.
  • Validation -- Their website includes an analysis of how their rankings compare to performance in three sports domains: Formula 1 racing, tennis, and swimming.  Our book discusses how our rankings compare to sports statistics (particularly with respect to baseball), but we also perform a more general set of validation tests, including correlations against 35 published rankings, prices of collectables including paintings and autographs, and public opinion polls.
To their credit, their website is fun to play with and features a host of interesting visualizations.

But how good are the rankings?  It is easy to cherry-pick any set of rankings for things that look weird. They name Rasmus Lerdorf (developer of the programming language PhP, who frankly I had never heard of) among their top 11,000 people, on the strength of being in more than 25 Wikipedia editions (he is actually in 31).  By comparison, we have him as the 51,670th most significant figure.  They rank Justin Bieber at 671 to our 8633, and Johnny Depp at 203 to our 2739, suggesting an over-emphasis of celebrity at the expense of gravitas.

But the right way to compare rankings is through validation measures.   This takes work, but I hope we can do such a study soon.  We will report our results here when we do.