## Lies, Damn Lies and Election Statistics

Posted by Deliverator on October 21st, 2008

I’ve been trying to get a sense of the likely results in the Senate races and happened to come across FiveThirtyEight a site which mixes polling data with statistics acumen and advanced computer modeling to predict election outcomes. Unlike many other sites replete with election graphs, maps, the site’s producers actually show you how the sausage gets made. I would be curious to hear from my brother and my friend Jason an analysis of the validity of their methodologies.

Their conclusions broadly are that Obama is almost certainly to win the election and that it will border on a landslide in the electoral vote, with the popular vote being won by a slimmer margin. In the Senate, Democrats are quite likely to gain 6-8 seats, but only have something like 1 chance in three of gaining a 60 seat filibuster proof working majority, especially factoring in fickle, nominal Democrats like Joe Lieberman. This prospective outcome makes substantial legislative reforms under an first term Obama administration less likely.

If you don’t want to read all of the technical jargon below, my takeaway is it is pretty good, but likely has bias to the current poll leader (Barack) because of a technical inadequacy, which I discuss below. An alternative to polling data is to look at result in prediction markets such as Intrade, which have futures markets for event outcomes such as elections. Right now it places odds of Barack winning at 84.5% which is lower than the odds given by FiveThirtyEight.com which are currently 93.4%.

I read the FAQ of the website which should be more appropriately named methodology. On face value the statistical analysis is fairly robust and certainly significantly more robust than what is portrayed in the mainstream media, but fails in a few respects. Simulation is a valid statistical technique in determining the probability distribution of a random variable which is the function of polytomous many other random variables. What he has done is to model the primary variable as a function of the sampling distribution of other variables (mainly the sampling distribution of individual states) and then used a technique called bootstrapping. However, he failed to consider the covariance of sampling errors. He has implicitly assumed the individual observations error terms are independently distributed. This is pure folly, anyone with any reasonable sense of politics knows that certain states behave very similarly to each other and very different from other groups of states. Prime example, rustbelt states (Michigan, Ohio, Pennsylvania) move together as a group but are largely independent of the Treehugger in the Pacific Northwest. Failure to properly estimate the correlation structure of error terms will certainly skew the results, the question is which direction is the bias. I believe it likely favors the leader (Obama), because it underestimates the likelihood of Obama taking multiple correlated states. This is the result of the statistical definition of correlation which says that Var(Z) = Var(X) + Var(Y) + 2*COV(X,Y), where Z = X + Y, which could be generalize to 50 variables (States) with a covariance matrix. It is interesting that he didn’t model this given the sophistication of the rest of the analysis (in which he clearly demonstrated that he knew the statistical properties correlation. For instance the methodology he used to update polls for trend ( since the poll was taken) uses a correlation structure.

He also states that he believes the election is a mean reverting stochastic process and presents some compelling evidence that it is. In plain English, the current leader’s margin deteriorates as the election day approaches. He does make an adjustment for this, but I believe there is a better way to estimate the mean reversion tendency. I would use panel data from past Presidential elections polling data in a regression to estimate estimate the stochastic differential equation dX(t) = lamda[0.5 – X(t)]dt + sigma*dZ(t), where lamda is the speed of mean reversion, sigma is the variance, and Z(t) is a brownian motion. This point is more nitpicking than a flaw with a definite bias.

For more on these topics read:

Bootstrapping