This is a simple illustration of the use of statistics in everyday life. The example here is inspired by the 2008 presidential election. In the run up to the election polls were taken daily by many news agencies to help predict the outcome. Various media made statements such as "from a poll of 1500 likely voters, we predict Obama is likely to receive 54% of the vote - the margin of error is 4%". But what does this mean? Why do only 1,500 people out of 100,000,000 likely voters need to be polled to make such a bold prediction?
Suppose all voters have decided who they will vote for and assume, for simplicity, the choice is limited to 2 candidates. If all likely voters are polled we will know the outcome exactly. If we poll a few of the likely voters, it is likely the vote percentage we get will be wrong. Nevertheless, if we poll many times, the average vote percentage should be a good estimate of the actual percentage. To see this we build a Matlab program that creates voters, assigns votes to these voters, and creates a histogram of the frequency of vote percentages over numerous samples taken from the collection of voters.
Important features of the Matlab program are as follows:
build a sample:
find the percentage of 1s in the sample:
accumulate the result in the histogram:
A complete program, including output statements, is stats_exp.m. With input parameters p=.52. n = 1000, nvoters = 1000000, and nruns = 1000 (run as stats_exp(.52,1000,1000000,1000)) we get:
error: range error < ------- ------- 51-53 0.399 50-54 0.146 49-55 0.032
We can get these answers mathematically as well. A simple description is in lec.pdf where the probability that a sample of size m contains k Obama voters is derived. The code for computing this probability is in geomdist.m. Using this code, confidence intervals may be accurately obtained and displayed using the matlab program t.m. For the same parameters as above, the output is:
guess: 52.00% between 51 and 53 with 49.37% confidence between 50 and 54 with 80.56% confidence between 49 and 55 with 94.65% confidence
Also of interest is the following histogram
which is obtained from the program stats.m (run as follows: stats(52000,100000) - that is, 100000 total voters 52000 of whom will vote for Obama). The vertical axis is the level of confidence and the horizontal axis is the sample size needed to get that confidence. The blue line represents a confidence interval of 6%, the green line 4%, and the red line 2%. Clearly, if we are interested in 90% or 95% confidence, the sample size needed to obtain that confidence is far less for an interval of 4% than for an interval of 2%. This is probably why so called "error margins" are set at 4% by the pollsters. Observe also that if the error margin is 4%, a sample of size 1500 will give us a 90% level of confidence.
The changing relationship between the actual number of voters and the sample size is evident in the figure below:
which is obtained by running stats(1040,2000). Observe that in all cases a sample size of 2000 results in complete confidence. To get 90% confidence at ±2% (4% interval) now requires a sample size of 850 which is smaller than 1500 for 100000 voters but 850/2000 = 0.425 whereas 1500/100000 = 0.015.