Saturday, March 9, 2013

Impressions: How to Lie with Statistics

I just finished reading the old classic "How to Lie with Statistics" (Linky), a good basic intro to statistics for non-statisticians. It is not a sophisticated text, and meant for people from all walks of life to get a simple hold on basic statistical maladies.

The book starts with the premises that: "The secret language of statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse, and oversimplify. Statistical methods and statistical terms are necessary in reporting the mass data of social and economic trends, business conditions, "opinion" polls, the census. But without writers who use the words with honesty and understanding and readers who know what they mean, the result can only be semantic nonsense.", and "A well-wrapped statistic is better than Hitler's "big lie"; it misleads, yet it cannot be pinned on you." In this direction, the book illustrates and explores issues in biased sampling, systematic/interviewer bias in opinion polls, the empirical meaning of probability as the frequency of occurrences of the said event, convergence issues, missing p-values in some of the grand claims in media, going from statistical mean to realization values, misleading graphics, unmarked axes, missing ranges/spreads instead of a single mean value, strawman arguments, fudging figures to report one's point-of-view, how so much sensationalism is tied to politicking, misleading comparisons in the before-and-after genre, disease/epidemic statistics, etc.

The grand-daddy of them all is the classical post hoc fallacy (Linky): attributing the wrong meanings to certain statistics by not understanding the difference between correlation and causation. For example, the standard fare, the smoking and low grades example, could be read as "smoking causes poor grades," or "poor grades cause people to smoke," or with no relations either way. Choose the one that pleases your ideology, agenda and propaganda intent! One would assume that anyone attaching strong causative attributions would bring in solid evidence (and if they cannot, one would hope that they would remain agnostic), but in this world of 140 character attention span, even good old statisticians fudge data. Lamentable, but a sad nature of the game that is life! While good intentions may make good people to make unneeded causative attributions, statistically it is still a falsity.

I would leave it to the reader's imagination for the wide variety of stories that Darrell Huff describes. But being a book from the 1950s, the book illustrates, even today, far more about the US of the 50s than people could care to see. Here are some of my impressions, not necessarily statistical, but impressions in any case:
1) Today, a mean stands for the arithmetic mean unless explicitly stated otherwise. It is rather difficult to visualize readers confusing the mean for the median or the mode. Surprisingly, this was how it was in the 50s as the three terms were often used interchangeably, possibly because of the new-found fancies for the Gaussian distribution where these three quantities coincide. It must be noted that a mean is preferred in scenarios where the upper and lower range of the variables are comparable, whereas the median is preferred in scenarios with outliers/extreme outliers, while mode can do what neither can in the case of categorical variables.

2) The Democrat-Republican divide we see today in American politics in terms of the lower middle-class and poor being the captive votebank of the Democrats, and the upper middle-class and the rich elites being the captive votebank of the Republicans seemingly stretches to the 30s with the Roosevelt vs Landon vote. In fact, Literary Digest's famous flop-show due to biased sampling (Linky) beats the Dewey-Truman spectacle (Linky) by a mile in terms of remarkable statistical lessons from the 20th century.

3) The book stresses the simple statistical intuition that preciseness in terms of statistical quantities goes hand-in-hand with "cooking the books." This is a very important lesson for today's big data applications that are over-hyped, oversold, and oversimplified. While notwithstanding the fact that data mining and analytics can indeed bring in some benefits to a priori (non-quantitative) formulations, expecting precise answers with high-dimensional data mining and model fitting is just that: a big fat joke sold in search of Series B or C funding from gullible VCs or grant funds from well-meaning philanthropists or agenda-driven organizations.

4) In terms of medical treatments, we hear nuggets of truism perpetuated by the foibles of the irrational human mind under suffering, pain and angst: "The guilt does not always lie with the medical profession alone. Public pressure and hasty journalism often launch a treatment that is unproved, particularly when the demand is great and the statistical background hazy." "As Henry G. Fulsen, a humorist and no medical authority, pointed out quite a while ago, proper treatment will cure a cold in seven days, but left to itself a cold will hange for a week." This reminds me of the Tamiflu scam and the associated famous Cochrane Review (Linky). In the genre of statistical mis-statements, let me add one more: Tamiflu is not efficacious and is over-expensive at 30$ a cycle; that comes with the firm backing of a sample-size of two! With the Cochrane Review, one could have made the sample-size two thousand and not at all be surprised :).

5) In the same vein is the section on the remarkable irrationality of humans on accident statistics that even the educated sound statistically stupid at times. As mentioned in one of my earlier posts, traffic accidents kill more people in India than a terror attack could or a nuclear accident might. Yet, we have more educated people fighting against nuclear plants in India today than about road safety guidelines. The less said about the business of terrorism, the better. Terrorism is business not only for the terrorists, but also the counter-terrorists. That is a truism!

6) On truisms, some simple (yet profound!) facts that are made in the book include: i) Nearly everybody could be below average, ii) It is dangerous to mention any subject having high emotional content without hastily saying where you are for or agin it, iii) A difference is a difference only if it makes a difference, iv) The fact is that, despite its mathematical base, statistics is as much an art as it is a science. A great many manipulations and distortions are even possible within the bounds of propriety. And so on.

7) Even popular best-sellers of the 50s liberally used the word "Negroes" to describe African-Americans, without any sense of compunction and/or morality. The simple fact is that treating fellow citizens as second-class for a loooong time cannot go hand-in-hand with claims to being extraordinarily exceptional. And sadly, the impact of American exceptionalism on the emerging Indian consciousness of today cannot do any overall good for India of tomorrow except to browbeat and forget history as it happened, and replace it with a pithy 140 character land of milk, honey and dreamz unlimited-type summation.

7b) On  that note, a better appreciation of the universal adult franchise that is a part and parcel of the Indian Constitution can be had when one realizes that India is perhaps one of the few countries in the world that started off with universal suffrage right from the day the first (and only!) Constitution got promulgated. The world's self-appointed greatest democracy did not. Nor did the UK or much of Europe, Asia, Africa, or the Americas. In fact, women first got the right to vote in New Zealand bang at the turn of the 20th century. The sad story around the Indian neighborhood of discrimination against fellow citizenry and the associated woes can be well-understood when one compares the Indian model and puts that side-by-side against the competing ones in the neighborhood. All that does not discount the simple fact that India still needs to compare itself with India of yesterday and not with some utopia from elsewhere. In that direction, the distance to cover is still and will remain unlimited.

8) And finally, the moral of the story, as I see it: Being the home of fundamental innovations and seminal contributions in large scale sampling techniques (Linky), it is remarkable that Indians still cannot "predict" their electoral outcomes with a measure of accuracy that is acceptable under the constraints that go with the multitudinous cacophony that is Indian democracy. In contrast, India apes the US in not being the land of the brave and the home of the free, but in being the land of the gruff and the home of the fluff.

Labels: ,


Post a Comment

Subscribe to Post Comments [Atom]

<< Home