Statistics

This is a simple illustration of the use of statistics in everyday life. The example here is inspired by the 2008 presidential election. In the run up to the election polls were taken daily by many news agencies to help predict the outcome. Various media made statements such as "from a poll of 1500 likely voters, we predict Obama is likely to receive 54% of the vote - the margin of error is 4%". But what does this mean? Why do only 1,500 people out of 100,000,000 likely voters need to be polled to make such a bold prediction?

Suppose all voters have decided who they will vote for and assume, for simplicity, the choice is limited to 2 candidates. If all likely voters are polled we will know the outcome exactly. If we poll a few of the likely voters, it is likely the vote percentage we get will be wrong. Nevertheless, if we poll many times, the average vote percentage should be a good estimate of the actual percentage. To see this we build a Matlab program that creates voters, assigns votes to these voters, and creates a histogram of the frequency of vote percentages over numerous samples taken from the collection of voters.

Important features of the Matlab program are as follows:

input variables:

1. p: the probability that a voter selects Obama
2. n: the sample size
3. nvoters: the number of voters
4. nruns: number of samples taken

initialization:

1. sample = zeros(1,n); %%% vector that holds sample results
2. voters = ceil(rand(1,nvoters)-1+p); %%% voter preferences
3. results = zeros(1,101); %%% the histogram - 0% to 100% by 1%

build a sample:

2. pick a random voter
3. check that we have not chosen that voter before in sample
4. if not, sample(i) is that voter and set i=i+1

find the percentage of 1s in the sample:

1. set count = 0
2. for all i from 1 to n do the following:
3. if voters(sample(i)) equals 1, increment count by 1

accumulate the result in the histogram:

1. determine which histogram bar is to be incremented
use: idx = floor((count/n)*100)+1;
2. increment the histogram bar
use: results(idx) = results(idx)+1;

A complete program, including output statements, is stats_exp.m. With input parameters p=.52. n = 1000, nvoters = 1000000, and nruns = 1000 (run as stats_exp(.52,1000,1000000,1000)) we get:

and

```   error:
range    error <
-------   -------
51-53      0.399
50-54      0.146
49-55      0.032
```
This says that with 97% confidence we can say the percentage of voters favoring Obama is between 49 and 5, with about 85% confidence we can say the percentage favoring Obama is between 50 and 54, with about 60% confidence we can say the range is between 51 and 53.

We can get these answers mathematically as well. A simple description is in lec.pdf where the probability that a sample of size m contains k Obama voters is derived. The code for computing this probability is in geomdist.m. Using this code, confidence intervals may be accurately obtained and displayed using the matlab program t.m. For the same parameters as above, the output is:

and

```guess: 52.00%
between 51 and 53 with 49.37% confidence
between 50 and 54 with 80.56% confidence
between 49 and 55 with 94.65% confidence
```
which is similar to the results obtained by experiment - increasing the number of runs of experiments would bring the two sets of results into closer agreement.

Also of interest is the following histogram

which is obtained from the program stats.m (run as follows: stats(52000,100000) - that is, 100000 total voters 52000 of whom will vote for Obama). The vertical axis is the level of confidence and the horizontal axis is the sample size needed to get that confidence. The blue line represents a confidence interval of 6%, the green line 4%, and the red line 2%. Clearly, if we are interested in 90% or 95% confidence, the sample size needed to obtain that confidence is far less for an interval of 4% than for an interval of 2%. This is probably why so called "error margins" are set at 4% by the pollsters. Observe also that if the error margin is 4%, a sample of size 1500 will give us a 90% level of confidence.

The changing relationship between the actual number of voters and the sample size is evident in the figure below:

which is obtained by running stats(1040,2000). Observe that in all cases a sample size of 2000 results in complete confidence. To get 90% confidence at ±2% (4% interval) now requires a sample size of 850 which is smaller than 1500 for 100000 voters but 850/2000 = 0.425 whereas 1500/100000 = 0.015.