Wednesday, April 23, 2014

Re-ranking the Pantheon

At the suggestion of Cesar Hidalgo, the leader of the Pantheon project, we repeated our previous analysis restricted to the top 1000 people in the Pantheon rankings.   This better captures the people their rankings think are important, so differences in our relative rankings become more meaningful.

First we look at the people from this pool who our methods rank higher than Pantheon.  By definition, all of these people will be highly regarded by both of our rankings.  It is clear that we favor American and British leaders higher than they do, because we analyze only the English Wikipedia :

860      907     47      Woodrow Wilson                    U.S. President 
841      996     155     Edward I of England              British King
776      961     185     Leonhard Euler                       Mathematician
674      697     23      Theodore Roosevelt                 U.S. President
634      799     165     John Milton                             British Poet/Philosopher
600      985     385     Alexander II of Russia            Russian Czar
583      789     206     Edward VI of England           British King
556      666     110     Dwight D. Eisenhower           U.S. President
553      970     417     John Dewey                            American Educator
550      954     404     Alexander I of Russia             Russian Czar
542      636     94      Harry S. Truman                      U.S. President                    
539      654     115     Bill Clinton                              U.S. President
538      889     351     Francis I of France                  French King
536      936     400     Soren Kierkegaard                  Danish Philosopher  
530      563     33      Charles Dickens                       British Writer
524      594     70      William the Conqueror             British King
509      815     306     Jacques Cartier                         French explorer of America
505      742     237     Henry IV of France                 French King
503      677     174     Geoffrey Chaucer                    British Writer
498      616     118     Lewis Carroll                           British Writer
495      762     267     Alfred the Great                       British King
486      962     476     Eleanor of Aquitaine                French/British Queen Consort
446      809     363     George H. W. Bush                  U.S. President
442      983     541     Archduke Franz Ferdinand      Proximate cause of WWI
441      900     459     John Wayne                             U.S. actor and "Duke"
439      545     106     Alexander Graham Bell           Inventor of the telephone

Still, these are figures who are generally quite familiar to me: I've heard of all of them, although I would not be confident in my ability to tell one Alexander from the other.  By contrast, there are several figures among the ones they rank much higher than we do who I could not place, or place as celebrities more than historical figures:

-7960    673     8633    Justin Bieber                             Teenaged popular singer
-8008    943     8951    Haruki Murakami                     Japanese novelist
-8460    850     9310    Carus                                          Short-ruling Roman Emperor
-8463    765     9228    Antisthenes                               Greek Philosopher
-8601    880     9481    Jenna Jameson                          American porn star
-8630    734     9364    Anacreon                                   Greek Poet
-8746    363     9109    Anaximenes of Miletus          Greek Philosopher
-8836    352     9188    James   son of Alphaeus         One of Jesus' twelve apolstles
-8932    919     9851    Polykleitos                                Greek sculptor
-9008    934     9942    Lysippos                                    Greek sculptor
-9674    851     10525   Carinus                                     Roman Emperor with Carus (above)
-9866    671     10537   Hor-Aha                                    Egyptian Pharaoh
-10628   920     11548   Kaka                                         Brazilian soccer player
-10696   775     11471   Orhan Pamuk                          Turkish novelist
-11153   839     11992   Abu Nuwas                              Classical Arabic poet
-11722   906     12628   Trebonianus Gallus               Short-ruling Roman Emperor
-11771   560     12331   Praxiteles                                 Greek sculptor
-11834   368     12202   Vitellius                                   Very short-ruling Roman Emperor
-13291   607     13898   Gaius Maecenas                      Roman political advisor
-14507   701     15208   Milan Kundera                        Contemporary Czech novelist
-14571   843     15414   Emir Kusturica                        Bosnian filmmaker
-16783   610     17393   Paulo Coelho                          Brazilian novelist
-19060   820     19880   Monica Bellucci                     Italian actress and model
-21652   737     22389   Francois Villon                       French poet of the Middle Ages
-22604   974     23578   Pedro Almodovar                   Spanish Film director
-22754   935     23689   Quintillus                                Short-lived Roman Emperor
-26427   963     27390   Jean Reno                                French actor

This roster makes clear the differences in our models for aging historical reputations.   About half of these historically-overvalued people are relatively minor figures from ancient times: short-lived Emperors and second-tier philosophers/poets/artists.  Many of the rest are contemporary celebrities who don't really belong in anyone's top thousand historical figures, like porn star Jenna Jameson.

There are also a few international artists of real stature (including Orhan Pamuk, Milan Kundera, and Pedro Almodvar) who might be undervalued by the English Wikipedia relative to international editions.  Still, I think our rankings place them in the right order of magnitude.

Wednesday, April 16, 2014

Ranking the Pantheon

A previous post described the MIT Pantheon, another project which used Wikipedia data to rank historical figures.   We (meaning Charles, of course) extracted their rankings and matched them to our historical significance rankings, so we could compare them.   There is some subtlety in algorithmic name matching, such as determining whether our "Jesus" is the same person as their "Jesus Christ", but we succeeded in matching 10,116 of the Pantheon names to our Who's Bigger rankings.  This is roughly 90% of the total, providing a reasonable basis for comparison.

First off: it is clear that there is substantial agreement among our placement of historical figures, with a Spearman rank correlation of 0.65 between us and them.   Both sets of rankings incorporate aging as part of the methodology, so much of this agreement rests on our preferences for the tried and true.  The Who's Bigger rankings of these figures have a rank correlation of 0.58 with year-of-birth (older historical figures being more highly ranked), while the comparable number is 0.53 with Pantheon.

More revealing is to look at the extremes: the figures whom we assign very different ranks from them.   In particular, we computed the difference between our ranks (Pantheon - us) and present the figures with the largest and smallest differences.   This is not a perfect statistic, since Pantheon ranks less than 12,000 people while our numbers go well above 800,000.   But it is revealing none the less.

Diff       Panrank  BigRank       Name                                    Who's Dat?
10120     10521       401        'John Marshall'                            Chief Justice of the US Supreme Court
10058     11184     1126        'Donald Bradman'                       Great Cricket champion
10027     10823       796        'William H. Seward'                    U.S. Secretary of State (bought Alaska)
9963       11077     1114        'Gough Whitlam'                         Australian Prime Minister
9933       10812       879        'John Churchill 1st Duke of Marlborough'     English Statesman
9915       10802       887        'George Washington Carver'       African-American Inventor
9886       10405       519        'Tipu Sultan'                                Ruler of the Kingdom of Mysore
9735       10146       411        'John Jay'                                     Early U.S. Statesman
9536        9935        399         'John C. Calhoun'                       U.S. Senator /VP (nullification)
9454        9886        432         'Susan B. Anthony'                     U.S. Suffragist (women's right to vote)
9439      11243      1804         'Alexander Mackenzie'                Second Prime Minister of Canada
9243      10064        821         'Abigail Adams'                           Wife of President John Adams
9215      10729      1514         'Robert Menzies'                          Longest serving Australian Prime Min.
9207      10917      1710         'Robert Byrd'                               Long-serving U.S. Senator
9175      10406      1231         'Sojourner Truth'                         African-American abolitionist
9171      10562      1391         'Lucille Ball'                                TV Comedian (I Love Lucy)
9171        9330        159         'John A. Macdonald'                   First Prime Minister of Canada
9165      10466      1301         'Edmund Barton'                         First Prime Minister of Australia
9130      10318      1188         'Mary Todd Lincoln'                   Wife of President Abraham Lincoln
9008      10086      1078         'Svetlana Kuznetsova'                 Russian tennis star

Almost all of these figures are from the English-speaking world: United States, Canada, Australia, Great Britain.   It is no surprise that our methods (which only analyze the English language Wikipedia) generally rank these people higher than Pantheon (which analyzes editions from all languages).  I personally recognize 14 of the twenty names here, and think they are generally quite Big, although I cringe a bit where some of our rankings are clearly too high (particularly Sultan and Kuznetsova).

The major American figures here are generally from the 19th century, which makes sense given the difference between our aging model and the one employed in Pantheon (full disclosure: Pantheon has recently changed its rankings, and what we have here may not be their current rankings).   In particular, our rankings have fully discounted a historical figure 160 years after birth, while they continued historical discounting arbitrarily far into the past).   Thus 19th century figures have generally achieved steady state by our analysis, so we value them relatively higher than Pantheon would.

The other side of the coin are the people who Pantheon ranks very much higher than we do.   The figures below all ranked in the bottom half of Wikipedia figures by our analysis, yet were identified by Pantheon among the 12,000 most interesting figures for analysis:

Diff            Panrank  BigRank       Name                                    Who's Dat?
-472241      8052         480293  'Alexandra Stan'                       Romanian singer and model
-484757      11086      495843  'Serge Haroche'                         French Nobel Prize winner in Physics, 2012
-493874      9471        503345  'Lola Pagnani'                            Italian actress
-495688      11148      506836  'Stephane Lannoy'                    French soccer referee
-497360      10133      507493  'Olivier Giroud'                         French soccer player
-517354      11160      528514  'Wouter Weylandt'                   Belgian professional cyclist killed in 2011
-525449      9576        535025  'Nathalia Dill'                            Brazilian television actress
-525475      10601      536076  'Milos Zeman'                           Current president of the Czech Republic
-525633      11232      536865  'David J. Wineland'                  Nobel Prize winner in Physics, 2012
-526148      10774      536922  'Gianluca Ramazzotti'             Italian singer-songwriter
-555909      11029      566938  'Linda Maria Baros'                 Contemporary French poet
-558970      10942      569912  'Jules A. Hoffmann'                  French Nobel Prize winner in Medicine, 2011
-573789      11144      584933  'Pastora Soler'                           Spanish Eurovision singer
-581161      11286      592447  'Sun Yang'                                Chinese Olympic swimmer
-601660      10310      611970  'Kevin Grobkreutz'                 German soccer player
-607491      11318      618809  'Missy Franklin'                      American Olympic Swimmer, 2012
-613223      11278      624501  'Brian Kobilka'                        American Nobel Prize winner in Medicine, 2011
-632278      11224      643502  'Lobsang Sangay'                    Prime minister in exile for Tibet
-685152      10556      695708  'Bernice Bejo'                          French-Argentine actress
-689256      11296      700552  'Vaclav Pilar'                            Czech soccer player
-693543       9577       703120  'Raphael Varane'                      French soccer player
-717448       11231     728679  'Ludmilla Radchenko'            Russian model and active 
-751460       10907     762367  'Anton Lamazares'                   Contemporary Spanish painter
-803441       11270      814711  'Petr Jiracek'                             Czech soccer player

These people are generally Europeans, who have the easiest time rising to the Pantheon Wikipedia language threshold.   They are also all very contemporary figures, many of who achieved their greatest renown for achievements occurring after the Wikipedia edition we analyzed in our rankings (October 11, 2010), so presumably they would be ranked somewhat higher if we reran our analysis today.

However, I personally only recognized one name here, and it required some prompting. Bernice Bejo was the lead actress in "The Artist" which, by the way, was a wonderful picture.   These people would generally not be in my 12,000 most significant (or famous) historical figures, but Pantheon's objectives are somewhat different than ours.   My guess is the both groups are content with our ranking differences given our different motivations.

Monday, April 7, 2014

Big Data Done Wrong?

An Op-Ed piece in today's New York Times by Gary Marcus and Ernest Davis present Who's Bigger as the seventh of eight (or nine) problems with Big Data, specifically "giving scientific-sounding solutions to hopelessly imprecise questions".  They acknowledge that we get many things right, but complain about "egregious errors".

But guys: given a 379 page book with thousands of rankings to pick from, your killer example is that we ranked Francis Scott Key at position 19 on the poets list?   If they don't have a complaint until position 19 on one of several dozens of tables in our book, well, we must be doing pretty darn good.

But their chosen example is illuminating, because it gets to the heart of what our rankings are and are not designed to do.  Our book carefully claims to measure "historical significance" or "meme strength", not "importance" as they insist on misrepresenting in the article.

So how historically durable will the Francis Scott Key meme be, say 100 years from now?   If there is still a United States stuck with the same national anthem (I'd take that bet), then we can be pretty certain the Marcus and Davis great-great-great-grandchildren will learn Key's words and the story behind his work.

"Oh say can you see?"  Only if you are willing to look at what data is actually trying to tell you.