« Fraud in Florida? | Main | Exit Polls: Winston's Theory »

November 18, 2004

The UCal Berkeley Report

First, one point I should have made more clearly in previous posts: The absence of significant evidence of fraud in exit polls does not prove the absence of fraud. When Warren Mitofsky says he sees no greater deviations for any particular type of voting equipment, he means he sees no differences big enough or widespread enough to be statistically meaningful. If vote fraud occurred in just a few counties in one state, the exit polls may have lacked the statistical power to detect it. That lack of power is what statisticians call “Type II Error.”

Which brings me to the U.Cal Berkeley report, (now available here – you need to scroll down to the link, “The Effect of Electronic Voting…).” The first thing to understand about the report, in the context of my recent posts, is that has nothing to say about exit polls. It depends in instead on a statistical analysis of county level voting patterns.

“Observer,” a commenter on my last post, made that point and also did a nice summary of the report's findings:

They are using multivariate linear regressions to explain voting patterns in Florida, and are finding a very statistically significant correlation between the presence of electronic voting and a higher percentage for Bush.

The paper has undergone some peer review prior to publication. It mentions two concerns that were raised about the methodology, and shows that when those concerns were addressed the findings did not change substantially.

Of course, now that the paper has been released on Internet it will be subject to a much wider review by others with far more expertise in statistical modeling than I can offer. 

For now, just keep in mind that is possible that the Berkeley report detected a discrepancy that the Florida exit poll missed, given the size of the discrepancy and the number of precincts sampled.  It is also possible that the full report on the Florida exit poll will contradict the Berkeley finding.  Once again, without more specific data from NEP, we really cannot say for sure.

Corrected mispellings of Berkeley 11/19.

Related Entries - Exit Polls

Posted by Mark Blumenthal on November 18, 2004 at 06:20 PM in Exit Polls | Permalink


I'm missing something here -- why is it "possible that the full report on the Florida exit poll will contradict the Berkley finding"?

If the exit poll study shows that there is no correlation between an increase in people who say they voted for Bush in 2004 and in 2000 and the type of machine they voted on (the constructs of the Berkeley study), why does that disprove the Berkeley study?

Isn't the Berkeley study, which used actual votes, more accurate than an exit pollster's estimate of the votes?

Posted by: Anon | Nov 18, 2004 6:34:47 PM


Electronic voting raised President Bush's advantage from the tiny edge he held in 2000 to a clearer margin of victory in 2004. The impact of the e-voting was not uniform, however. Its impact was proportional to the Democratic support in the county, i.e, it was especially large in Broward, Palm Beach, and Miami-Dade."

Isn't there a fairly obvious plausible reason that Bush showed especailly large improvements in these counties? Namely, that they have a large number of Jewish voters who like Bush's support of Israel?
(Since the Census Bureau does not ask about religion it is hard to get exact figures to use in a regression analysis of this factor.)

Posted by: David T | Nov 18, 2004 9:51:11 PM

One possible omitted variable from the Berkeley study that immediately suggests itself is the lack of any control variable for urban versus rural counties. We know, if we can believe the exit polls, that nationally Bush substantially increased his vote share among urban voters relative to rural voters. Since Florida touchscreen voting is concentrated in urban areas, it appears the Berkeley folks have a potential omitted variable problem.

Now before throwing the whole Berkeley study out the window, it may be the urban rural national numbers are driven by Bush's performance improvement in the non-swing Northeastern states. We know Bush did much better against Kerry than Gore in New Jersey but in Ohio we know Bush's victory had a distinctly rural flavor and the urban areas came out big for Kerry.

So until the Berkeley authors take a stab at the urban/rural issue, I would reserve judgement on their findings.

Posted by: ftm | Nov 18, 2004 10:21:25 PM

It's possible that there is some rural/urban issue, but the variables used by the Berkeley group include county population, which in principle could take care of this.

I have found in my own analysis that pre-election polls in Florida, but not 22 other battleground states, failed to predict the outcome. The discrepancy I found was about 270,000 votes, which is within range of what the Berkeley group is claiming.


Posted by: Sam Wang | Nov 18, 2004 11:07:23 PM

thanks sam,

I totally glossed over the county size variable. I don't know florida but one would assume county population size is a good proxy for urbanness.

The next question is how are the coefficent standard errors effected by the strong correlation between size and the voting technology variable? I don't recall any discussion of this in the paper.

Posted by: ftm | Nov 18, 2004 11:59:20 PM

Another factor that (unless I missed something) they didn't take into account): percentage of retirees--a group that polls elsewhere showed tilting considerably toward Bush compared with 2000, and a group that is certianly well-represented in those three counties (Miami-Dade, Broward, Palm Beach).

Posted by: David T | Nov 19, 2004 2:30:23 AM

I think we should cast doubt on the main premise of this discussion. In fact, Bush relative gain in electronic voting counties is less than in optical scanning counties, however you measure it. Using UCAL's own dependent variable (Bush percentage in 2004 minus Bush percentage in 2000), its mean equals 3,02 in "electronic counties" and 5,07 in "optical counties". Again, if we take Bush's relative gain (Bush vote 2004 / Bush vote 2000) and divide it by Kerry's relative gain (Kerry vote 2004 / Gore vote 2000), we get equivalent results: a mean of 1,11 in "electronic counties" and 1,21 in "optical counties". There's, then, no mistery to be solved.

So, the whole excercise seems to be rather futile, and the UCAL's result merely the product of statistical "artifact" (the variable electronic vote only correlates positively with their dependent variable when you include the variable "Bush percentage 2000 * electronic vote").

Posted by: Wonka | Nov 19, 2004 9:40:21 AM

I forgot. If you repeat UCAL's exercise on Kerry's data, you get the same positive influence of electronic voting (the more electronic voting, the bigger Kerry's gain). Remarkable.

Posted by: Wonka | Nov 19, 2004 10:36:59 AM

The Bush supposed gain in urban/suburban as opposed to rural voters that appeared in the exit polls seems highly, highly dubious when you look at actual results from urban/suburban counties and compare them to results from rural/exurban counties. Bush gains a few points in some urban/suburban areas, loses a few points in some, and never really seems to increase anywhere near the extent that the exit polls claim he does.

Posted by: John | Nov 19, 2004 12:05:07 PM

Buyer beware. Just because it's peer-reviewed doesn't mean it has any basis in reality.

I took their spreadsheet and broke down Bush's gains in e-voting vs. non-e-voting counties. Using aggregate numbers, Bush increased his support 2.54% in optical counties and 2.25% in e-voting counties. Wonka is right -- this seems to be a case of applying overly complicated modeling to a simple data set to achieve a convenient result.

More here: http://www.patrickruffini.com/archives/2004/11/fisking_the_ber.php

Posted by: Patrick Ruffini | Nov 19, 2004 12:13:19 PM

The doubts being cast on the Berkeley study are fascinating. We shall see.

But would it make sense for the group to study all states as well for these patterns, not just Florida and Ohio? Considering it was the unexpected Blue Counties that showed discrepancies, analysing all states may reveal similiar surprises.

Posted by: Alex in Los Angeles | Nov 19, 2004 12:34:17 PM

You guys inspired me for my latest blog entry: "[T]his seems to be a case of applying overly complicated modeling to a simple data set to achieve a convenient result."

So I thought "Why not use a really simple model and see what happens?"

I ran the data using the percent that Bush's vote share improved as the "y" variable and a 1/0 dummy "x" variable for whether a county used a electronic or optical scanning. The result: an R-square of .02, rho of .207, and t-stat of -1.27. In plainer English, optical versus electronic voting explains a very small percentage of the variance in Bush's performance between counties (the r-square stat), and doesn't explain it to a statistically significant degree (rho, t-stat).

I think what you have here is a simple case of autocorrelation. Some of their variables probably correlate highly to each other -- eg Bush in 2000 to Dole in 1996, while both of those variables probably correlate to hispanic population, median income, and county size. Indeed, county size correlates much more precisely with the type of voting machine used than does the degree of improvement in Bush's share of the vote (r-square of .28, rho of 4.98x10-6, t-stat of 4.97).

Why does this matter? Because independent variables have to be independent of each other for regression to work. If you use two correlated variables, they essentially resonate with each other, making it seems as if they explain a much greater amount of the variance than they actually do.

What seems to be afoot here is what we see nationwide: Bush increased his share in urban counties with respect to rural counties. It just happens that in Florida, the urban counties use electronic voting machines.

Incidentally, the claim that the greatest increases for Bush were in Democratic-leaning counties is partially true: within the e-voting counties, Bush saw the greatest increase in his vote share in Miami (2nd), Broward (4th), and Palm Beach (6th) counties. But it is a very small dataset, and the 1st, 3rd, and 5th counties are counties where Bush receieved 56, 50, and 70 percent of the vote in 2000.

Posted by: Sean | Nov 19, 2004 1:54:16 PM

Ruffini's comments seem to be completely irrelevant. The direct correlation between change in vote and type of voting machine is not a particularly interesting or useful quantity. The whole point of studies like this is that you have to correct for various demographic issues. That's why you do these large multivariate regressions.

Now, I haven't read the Berkeley study and won't comment on its validity, but Ruffini's comments certainly do not represent a legitimate attack on it.

As for what Sean has to say, if I remember my stats, isn't there a covariance matrix output by multivariate regressions that would detect such an effect?

Posted by: Aaron | Nov 19, 2004 4:24:37 PM

It seems to me that comments pointing out variables that are not captured in the model are interesting on their own, but are not really criticisms of the results in the Berkeley study itself, unless those omitted variables are likely to be correlated with the E_vote dummy variable used in the study. If this were true (e.g. suppose all - and only - the e_vote counties had a recent influx of conservative-leaning retirees or something like that), then the E_vote dummy would actually be standing in for something else, which one could argue is why it is significant. But unless one can make a reasonable case that the e_vote dummy actually counts as a proxy for something else, then all we are really talking about with omitted variables are efforts to increase the model's "explained variance", not rejecting the value of any one particular beta coefficient. If new variables added to the model aren't thought to capture the same variance as the E_vote variable, then their omission cannot stand as a criticism of the individual result. Right?

Posted by: galt_m | Nov 19, 2004 5:55:58 PM

The interesting issue posed by the U Cal study was captured by your original title for the subject: Fraud in Florida?

Having looked at the U Cal study, your write-up and the comments to it, I don't see anything remotely indicating fraud. The legal standard to prove fraud is clear and convincing evidence, which requires more rigorous proof than an ordinary civil case (preponderance of the evidence) but less rigorous proof than a criminal case (beyond a reasonable doubt). The burden of proof is on the party claiming that a fraud occurred. I realize that partisans throw around accusations of fraud with abandon these days, relying on no standard and no evidence but what suits their immediate advantage. But there is no point even discussing serious stuff with partisans of that stripe. Since the suggestion in the title of your blog was that there may have been an effort to influence a national election by fraud, it's obvious that we are discussing a very serious subject indeed.

Having only a rudimentary understanding of the statistics, I may be missing the significance of the U Cal study. But, in essence, it seems to be looking for correlations between Bush's increased vote and e-voting equipment, while accounting for such other variables as county size, demographics, etc. That's all fine, and may be of interest to specialists interested in such things. But it is a far cry from being evidence of fraud.

To show fraud, there needs to be some proof that votes counted as having been cast for Bush were not in fact cast for him by a qualified voter. What does the U Cal study have to say about that? Basically nothing, as far as I can tell.

Once all the correlations are made, and the probabilities calculated, certain basic facts stand in the way of drawing the essential inference needed to show fraud. We know, for example, that in 2000 Bush got about 50,000,000 votes, while in 2004 he got about 60,000,000. Some of that is new voters, and some of that reflects an improvement over his prior performance among repeat voters. The only data that suggests how he did among subgroups (whichever ones you want to consider) are the exit polls. All the exit polls indicate that some voters who rejected Bush in 2000 voted for him in 2004.

One of the commenters suggests that a switch by a significant number of Jewish voters in Broward, Palm Beach and Miami-Dade may explain the difference in some of the Florida counties that figure in the U Cal study. Others on the web and the press have speculated, with some basis in the exit polling data, that Bush did better among various other groups (blacks, black veterans, Hispanics, etc.) than he did in 2000. Whether there were significant regional or even localized variations in those effects in parts of the country or individual states would require lots more data than I am aware exists.

The key point, in terms of the ability of the U Cal study to show fraud, is that all kinds of explanations such as that can be posited which, if true, account for Bush's superior performance.

Unless someone can eliminate those kinds of explanations -- and I see nothing in the U Cal study or anywhere else that comes close -- trying to use this type of study to show fraud is worse than useless. As it relates to the issue of fraud, teh U Cal study shows nothing of significance. The only conclusion of relevance to your initial question, Fraud in Florida?, is that, without a paper backup of some kind, the results from an electronic voting system cannot be verified after the fact. Everyone knew that going into the election, and only the deeply partisan loony fringe would try to spin that fact or the contents of the U Cal study into a claim that a fraud of enormous proportions occurred.

Posted by: Richard | Nov 19, 2004 6:41:14 PM

It is conceivable that collinearity is a problem. Hypothetically, this might be suggested if the correlation between two predictor variables is high, leading to a result where one variable is significant (e.g. e_vote) and another is not (e.g. Hispanic), but in actuality the two are confounded. Practically speaking, collinearity is usually only a problem where the correlation is greater than .7. Of the variables included in the model (and of course not comparing to the interaction terms), the correlations are relatively small, though many are significant. The use of E-touch machines is significantly correlated with income (r = .33), size (r = .41), Bush 2000% (r = -.18), and v_change (r = .30) but not Hispanic (r = .20) or Dole’s 1996 Percent (r = -.10). None of these correlations is so high as to suggest that collinearity in the model is a problem. Of course, there are other issues of multicollinearity, but it's Friday night.

Posted by: galt_m | Nov 19, 2004 6:42:33 PM

The Berkeley study does not show fraud (the authors don’t profess to have ‘proof’ either. This is a word that statisticians don’t tend to use since they only deal in probabilities. The U. Penn paper that came out earlier this week did the same thing, as in 1:250 Million probability or something) but it is suggestive of an anomaly, as in “look deeper here and you might find something.” The authors of the Berkeley study worded their results and conclusions so carefully, and not only provided their data but explained how they reached their result (ahem – polling companies?). But I’d like to comment on this: “Unless someone can eliminate those kinds of explanations -- and I see nothing in the U Cal study or anywhere else that comes close -- trying to use this type of study to show fraud is worse than useless.” I don’t think it is that simple. If one makes a convincing argument that any of the omitted variables is highly correlated with the E_vote variable, then there is a potentially gapping weakness to the Berkeley study. But if none of these omitted variables is correlated with E-Vote (i.e. E_vote captures and explains ‘unique’ variance), then those theories are likely not up to much. I’m not saying none of them are correlated with e_vote, just that I haven’t heard those kinds of arguments yet.

Posted by: galt_m | Nov 19, 2004 7:02:48 PM

"Incidentally, the claim that the greatest increases for Bush were in Democratic-leaning counties is partially true: within the e-voting counties, Bush saw the greatest increase in his vote share in Miami (2nd), Broward (4th), and Palm Beach (6th) counties. But it is a very small dataset, and the 1st, 3rd, and 5th counties are counties where Bush receieved 56, 50, and 70 percent of the vote in 2000."

Um, I'm not sure where the study's authors are getting this. In the case of Miami-Dade, almost the exact opposite is true. Miami-Dade turned in the fourth-worst improvement for Bush statewide, and one of only two counties where Bush lost vote share. Incidentally, Broward and Palm Beach were Bush's 34th and 38th most improved counties (sort on "b_change" in the Excel spreadsheet) with Bush vote gains that were slightly above average the statewide average (3.5% and 3.0% vs. 2.5%). There was nothing remarkable or "unexpected" about these returns -- even within the context of the very small subset (15) of e-voting counties. Palm Beach and Broward rank 3rd and 5th, behind Sumter (Bush 63%) and Pasco (Bush 54%). Miami-Dade is second to last on this list. If you combine the three urban Democratic counties that are the focus of this study, their average percentage vote gain for Bush is very close to the statewide average. What is the basis for claiming that there was any outsized or unexplained gain for the President in these counties?

Demographic factors do come into play -- mostly in explaining the richness and variability in these voting populations. Why do urban/e-voting counties like Miami-Dade and Broward/Palm Beach behave so differently, when a study like this assumes they should behave the same? In Miami-Dade, I suspect that the Cuban vote was down somewhat from four years ago, when the Elian Gonzales affair helped produce a 27% swing to Bush in Hispanic precincts. In Broward and Palm Beach, I suspect that gains in the Jewish vote and the senior vote helped Bush. None of these factors is contemplated in the "study"; there are no columns for Cuban-Hispanics, Jewish voters, seniors, and literally hundreds of other demographic factors that could drive the vote in the data set.

A simple regression analysis of 67 diverse and internally complex data points, focusing on a subset of just 15, depending heavily on just 2 or 3 for its conclusions, will always fall hopelessly short -- even when it doesn't state a conclusion that's flatly contradicted on its face by the data itself.

Posted by: Patrick Ruffini | Nov 19, 2004 7:29:51 PM


My understanding from the press release that I read was that the authors claimed that Bush's greatest improvement came in the most heavily Democratic counties in the state. If I misread, I apologize.

The other thing I don't understand is why you even need the other variables in this model? If your question is just "how much did the type of voting machine affect Bush's vote share?" wouldn't you just do a straight up simple regression analysis of vote share improvement to type of machine? The other variables are simply fluff the expand the miserably low portion of the variance than machine type by itself explains.

Posted by: Sean | Nov 19, 2004 8:41:23 PM

An analysis of the Hout study can be found here


which refers to


Posted by: Aaron Bergman | Nov 19, 2004 9:25:57 PM

Sociologists of Hoult's stature can be trusted with the math, I assume. However, Political Scientists/Pollsters would probably be the ones to trust to expose the e-vote as a dummy variable for something else. I'm staying right here to see what comes up.

Isn't it easier to plug in other factors than try to argue they are correlated to e-vote? It looks to me the study looks mostly at demographic factors rather than "political factors" such as:

3rd party vote?
# of registered independents?
# of Jewish residents?
# of spoiled votes? granted these would be 0 in e-vote counties in 2004, but non-0 in 2000.

And what of Wonka's point? Assuming he is right, would it mean anything that:
"If you repeat UCAL's exercise on Kerry's data, you get the same positive influence of electronic voting (the more electronic voting, the bigger Kerry's gain)."

Posted by: Alex in Los Angeles | Nov 20, 2004 2:37:39 AM

There's a good analysis of the data at Crooked Timber (http://www.crookedtimber.org/archives/002890.html) that gets right at the results, namely that two large counties (Broward and Palm Beach) have enormous influence. When controlling for their effect, the coefficient for E-Vote becomes insignificant.

Posted by: galt_m | Nov 20, 2004 8:42:04 AM

I replicated the Berkeley study and concluded that the results are completely worthless. The source of the problem, which is consistent with many of the observations posted here, is multicolinearity. This problem arises when two or more of the independent variables are highly correlated with one another. The full correlation matrix of the electronic voting "etouch" variable (using the Berkeley variable naming) plus the two interaction terms of etouch with Bush vote in 2000 (b00pc_e) and it's square (b00pcsq_) reveals that these variables are highly correlated with one another:

| etouch b00pc_e b00pcsq_
etouch | 1.0000
b00pc_e | 0.9777 1.0000
b00pcsq_ | 0.9272 0.9845 1.0000

In cases like these, statistical software breaks down and essentially produces garbage results. Many of the other comments in this thread, such as omitted variable bias, etc. are also relevant, but the multicolinearity issue throws off the analysis so much that it alone dominates all other explanations of the odd findings in the study.

I am scheduled to discuss the Berkeley voting study with Dr. Hout on Monday, Nov. 22 at 7:30am on San Francisco radio station KPFA. I have sent Dr. Hout detailed analysis and will be interested to hear his response. I posted this detailed analysis on the election law list, linked at the top of this thread.

Posted by: Michael McDonald | Nov 20, 2004 7:05:29 PM

In response to the comment from Alex in Los Angeles: yes, the number of Jewish residents seems to be very important, especially if entered interactively in the way that electronic voting was entered. If entered that way, the coefficients on electornic voting become essentially zero. See my blog entry, http://newmarksdoor.typepad.com/mainblog/2004/11/more_on_the_stu.html, for details.

Posted by: Craig Newmark | Nov 20, 2004 9:51:25 PM

The standard way to address the problem of multicollinearity in regression is to center the collinear variables (such that their means are zero). It is meaningless to center the E_touch variable because it is a dummy (you get the same result as below anyway), but one can center b00pc, using the result to calculate both the revised b00pc_sq and the interactions with E_touch.

Using the newly centered data, one finds: (1) the same correlation between E_touch and b00pc_cen (centered variable) of .18 (as expected, unchanged from using non-centered data). One also finds the correlation between E_touch and the squared term (b00pc_censq) is .14 (the correlation with the un-centered term is -.16). Using this centered variable to calculate the interaction terms, one finds the correlation between E_touch and the E_touch*b00pc_cen interaction is -.33, and the correlation between the E_touch and the E_touch*b00pc_censq interaction is .53. Thus, the correlation between E_touch and the interactions has been reduced drastically from their previous levels of above .9.

These new correlation values are not in the realm where one would normally worry about multicollinearity. Still, the only way to test is to look at the model using the newly calculated variables. The variables which have no significant effect on the outcome measure are size (St. B = -.22, t = -.587, p =.56), v_change (St. B = 0.00, t = .00, p = 1), income (St. B = -.18, t = -1.08), D96pc (St. B = -.44, t = -1.30, p =.20) and E_touch (St. B = -.17, t = -1.43, p =.16). This means that by centering the variables that are otherwise so highly correlated as to make collinearity a threat, the E_touch variable main effect becomes insignificant. The interaction of E_touch with the centered B00pc variable is negative and significant though (St. B = -.36, t = -2.05, p = .045). The interaction of E_touch with the centered B00pcsq is positive (St. B = .34, t = 1.81) and nearly significant (p = .076).

What this means is that a procedure used to address multicollinearity (habitually confronted in regressions that employ interactions), when used, makes the ‘interesting’ result disappear. Now, based on other analyses, it is clear that two counties (Palm Beach and Broward) also had an enormous influence on the results. So, taking these two counties out of the analysis and running the analysis with the centered variables, the new coefficient for E_touch has an even smaller effect (St. B = -.12, t = -.92, p = .36). In fact, in this analysis, the only variable that is a significant predictor is B00pc [centered] (St. B. = .67, t = 2.22, p = .03).

It seems that these two analyses put the Berkeley paper’s results somewhat into question.

Posted by: galt_m | Nov 20, 2004 10:58:44 PM

The comments to this entry are closed.