Application Center - Maplesoft

App Preview:

High School Advanced Topics - The Central Limit Theorem

You can switch back to the summary page by clicking here.

Learn about Maple
Download Application


 

Central Limit Theorem.mws

High School Modules > Miscellaneous Advanced Topics

     The Central Limit Theorem


An exploration of the underlying concept of the Central Limit Theorem.

[Directions : Execute the Code Resource section first. Although there will be no output immediately, these definitions are used later in this worksheet.]

  0. Code

>    restart;

>    with(plots): with(stats):

Warning, the name changecoords has been redefined

>    #-------------------------------------------
data  :=[3,3,4,8,8,8, 10,13,15, 16,16,18,21,23,24]:

two_extremes_data :=[3,3,3,3,3,3,3,3,3,3,25,25,25,25,25,25,25,25,25,25]:

evenly_distributed_data :=
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]:

big_data_set :=
[42,42,42,42,42,42,42,43,43,43,43,43,43,43,45,45,45,45,45,45,45,45,
46,46,46,47,47,47,47,48,48,48,48,48,49,49,49,51,51,51,51,52,52,55,
55,55,58,58,58,59,61,61,62,62,62,63,63,63,64,64,65,65,65,66,68,69,
69,71,71,71,73,73,75,79 ]:

>    #-------------------------------------------
Samples := proc(dataset, size)
   local Sam, Su, k, mu, x, m;
   m := rand(1..nops(dataset)):
   Sam := []:
   Su := 0;
   for k from 1 to size do
      Sam :=  [op(Sam),dataset[m()] ]:
   od:
   Sam;
end proc :

>    #-------------------------------------------
SampleMean := proc(dataset, size)
   local Sam, Su, k, mu, x,m;
   Sam := []:
   Su := 0;
   m := rand(1..nops(dataset)):
   for k from 1 to size do
      x   :=  dataset[m()];
      Su  :=  Su + x;
      Sam :=  [op(Sam),x]:
   od:
   #print(S);
   mu := evalf(Su/size,6);
end proc :

>    #-------------------------------------------
SampleMeans := proc(dataset, numsamples)
   local SaMn,j;
      SaMn := []:
       for j from 1 to numsamples do
         SaMn := [op(SaMn),SampleMean(data, 10) ]:
       od:
end proc :

>    #-------------------------------------------
#  Display a data distribution
DataDist := proc(dataset)
   local meo,sdo,me,sd, MM,mn,mx,x,y,i,j,k,h,h2,h3,
      LE,RE,MNo,MN2o,PP,Ln,CG,CB,SDO,Tmn,Tmx,Tsdl,Tsdr;
   #-------------------------------------------------
   h := 2;        h2 := .7;   h3 := .75;  
   CG := 'color = COLOR(RGB, .05,.5,.05),symbolsize = 15,symbol=BOX':
   # ------- Original DATA ----------------------------------------
   meo := evalf( describe[mean](dataset));
   sdo := evalf(describe[standarddeviation](dataset));
  
   MM := describe[range](dataset);
   mn := op(1,MM);
   mx := op(2,MM);
   #--------- Plots on Original Data -----------------------------
   LE    := plot( [[mn ,-h3],[mn+h2,-h3],[mn+h2,h3],[mn,h3],[mn,-h3]],
                color = green, style=patchnogrid, filled = true):
   RE    := plot( [[mx ,-h3],[mx-h2+.5,-h3],[mx-h2,h3],[mx,h3],[mx,-h3]],
                color = green, style=patchnogrid, filled = true):
   MNo   := plot( [[meo ,-h2],[meo-h2,0],[meo+h2,0],[meo,-h2]],
                CG, style=patchnogrid, filled = true):
   MN2o  := plot( [[meo , -h-h2],[meo,0]], CG, linestyle = 3):
   Ln    := plot( [[mn,0],[mx+.5,0]], color = blue, thickness = 4):
   SDO   := plot( [[meo-sdo,-h-h2],[meo-sdo,-h],
                   [meo+sdo,-h],[meo+sdo,-h-h2]],
                   CG, linestyle = 3):
    #------------ TEXT ------------------
   Tmn  := plots[textplot]( [mn, -h, mn],
             align={BOTTOM,RIGHT},font=[TIMES,ROMAN,12],color = black):
   Tmx  := plots[textplot]( [mx, -h, mx],
             align={BOTTOM,LEFT},font=[TIMES,ROMAN,12], color=black):
   Tsdl  := plots[textplot]( [meo-sdo, -h+h2, evalf(meo-sdo,3)],
             align={BOTTOM,RIGHT},font=[TIMES,ROMAN,12], color=black):
   Tsdr  := plots[textplot]( [meo+sdo, -h+h2, evalf(meo+sdo,3)],
             align={BOTTOM,LEFT},font=[TIMES,ROMAN,12], color=black):

   #------------ Original Distribution Points ------------------
   PP||1 := pointplot( [dataset[1],-3-h3],  CG):
   y := h3;
   for i from 2 to nops(dataset) do
       if (dataset[i]=dataset[i-1]) then y := y+h3; else y:= h3; fi;
       PP||i := pointplot( [dataset[i],-3-y ], symbolsize = 15, CG ):  
   od:

  
   #-----------------------------------------------------------------
   display([ MNo,MN2o,Ln,LE,RE, SDO, Tmn, Tmx, Tsdl,Tsdr,      
          seq(PP||i, i = 1..nops(dataset))
         ], scaling = constrained, axes = none );
   #-----------------------------------------------------------------
end proc :

#DataDist(data);

>    #-------------------------------------------
SampleMeanDist := proc(dataset, numsamples, samplesize)
   local meo,sdo,me,sd, MM,mn,mx,x,y,i,j,k,h,h2,h3,
         LE,RE,MN,MN2,MNo,MN2o,SaMn,DS,SP,PP,Ln,CG,CB,SS, SDS,SDO,
         Tmn, Tmx, Tsdl,Tsdr,Tsdlo,Tsdro;
   #-------------------------------------------------
   h := 2;        h2 := .7;   h3 := .45;  
   CG := 'color = COLOR(RGB, .05,.5,.05)':
   CB := 'color = COLOR(RGB, .2, .1,.5 )':
   SS := 'symbolsize = 15,symbol=BOX':
   # ------- Original DATA ----------------------------------------
   meo := evalf( describe[mean](dataset));
   sdo := evalf(describe[standarddeviation](dataset));
  
   MM := describe[range](dataset);
   mn := op(1,MM);
   mx := op(2,MM);
   #--------- Plots on Original Data -----------------------------
   LE    := plot( [[mn ,-h3],[mn+h2,-h3],[mn+h2,h3],[mn,h3],[mn,-h3]],
                color = green, style=patchnogrid, filled = true):
   RE    := plot( [[mx ,-h3],[mx-h2,-h3],[mx-h2,h3],[mx,h3],[mx,-h3]],
                color = green, style=patchnogrid, filled = true):
   MNo   := plot( [[meo ,-h2],[meo-h2,0],[meo+h2,0],[meo,-h2]],
                CG, style=patchnogrid, filled = true):
   MN2o  := plot( [[meo , -h-h2],[meo,0]], CG, linestyle = 3):
   Ln    := plot( [[mn,0],[mx+.5,0]], color = blue, thickness = 4):
   SDO   := plot( [[meo-sdo,-h-h2],[meo-sdo,-h],
                   [meo+sdo,-h],[meo+sdo,-h-h2]],
                   CG, linestyle = 3):
   #------------ Original Distribution Points ------------------
   PP||1 := pointplot( [dataset[1],-3-h3], SS,CG):
   y := h3;
   for i from 2 to nops(dataset) do
       if (dataset[i]=dataset[i-1]) then y := y+h3; else y:= h3; fi;

       PP||i := pointplot( [dataset[i],-3-y ], SS,CG ):  
   od:

   # ------- Create Sample Mean Dist -------------------------------
   SaMn := []:
   for j from 1 to numsamples do
      SaMn := [op(SaMn),SampleMean(dataset, samplesize) ]:    od:
   
   # ------- Create Sample Mean Dist -------------------------------
   me   := evalf( describe[mean](SaMn));
   sd   := evalf(describe[standarddeviation](SaMn));
   MN   := plot( [[me ,h2],[me-h2,0],[me+h2,0],[me,h2]],
             color = red, style=patchnogrid, filled = true):
   MN2  := plot( [[me , h+h2],[me,0]], color = red, linestyle = 3):
   SDS  := plot( [[me-sd,h+h2],[me-sd,h],[me+sd,h],[me+sd,h+h2]],
             color = red, linestyle = 3):

   #------------ Sample Mean Distribution Points ------------------
   SP||1 := pointplot( [SaMn[1], 3+h3], SS, CB):
   y     := h3;
   for i from 2 to nops(SaMn) do  
      if (SaMn[i]-SaMn[i-1]<.3) then y := y+h3; else y:= h3; fi;
       SP||i := pointplot( [SaMn[i],3+y ], SS, CB ): od:

   #------------ TEXT ------------------
   Tmn  := plots[textplot]( [mn, -h, mn],
             align={BOTTOM,RIGHT},font=[TIMES,ROMAN,12],color = black):
   Tmx  := plots[textplot]( [mx, -h, mx],
             align={BOTTOM,LEFT},font=[TIMES,ROMAN,12], color=black):
   Tsdlo  := plots[textplot]( [meo-sdo, -h+h2, evalf(meo-sdo,2)],
             align={BOTTOM,RIGHT},font=[TIMES,ROMAN,12], CG):
   Tsdro  := plots[textplot]( [meo+sdo, -h+h2, evalf(meo+sdo,2)],
             align={BOTTOM,LEFT},font=[TIMES,ROMAN,12], CG):
   Tsdl  := plots[textplot]( [me-sd*1.5, h-h2, evalf(me-sd,2)],
             align={BOTTOM,RIGHT},font=[TIMES,ROMAN,12], color=red):
   Tsdr  := plots[textplot]( [me+sd*1.5, h-h2, evalf(me+sd,2)],
             align={BOTTOM,LEFT},font=[TIMES,ROMAN,12], color=red):


   #-----------------------------------------------------------------
   display([ MN,MN2,MNo,MN2o,Ln,LE,RE, SDS,SDO,
             Tmn, Tmx, Tsdl,Tsdr ,Tsdlo,Tsdro,        
          seq(PP||i, i = 1..nops(dataset)),
          seq(SP||i, i = 1..nops(SaMn))
         ], scaling = constrained, axes = none );
   #-----------------------------------------------------------------
end proc :


#SampleMeanDist(big_data_set, 10, 12);

>    #-------------------------------------------
Centralplot := proc(dataset)
   local meo,sdo,me,sd, MM,mn,mx,x,y,i,j,k,h,h2,h3,
      LE,RE,MNo,MN2o,PP,Ln,CG,CB,SDO,Tmn,Tmx,f,NP;
   #-------------------------------------------------
   h := 2;        h2 := .7;   h3 := .75;  
   CG := 'color = COLOR(RGB, .05,.5,.05),symbolsize = 15,symbol=BOX':
   # ------- Original DATA ----------------------------------------
   meo := evalf( describe[mean](dataset));
   sdo := evalf(describe[standarddeviation](dataset));
  
   MM := describe[range](dataset);
   mn := op(1,MM);
   mx := op(2,MM);
   #--------- Plots on Original Data -----------------------------
   LE    := plot( [[mn ,-h3],[mn+h2,-h3],[mn+h2,h3],[mn,h3],[mn,-h3]],
                color = green, style=patchnogrid, filled = true):
   RE    := plot( [[mx ,-h3],[mx-h2,-h3],[mx-h2,h3],[mx,h3],[mx,-h3]],
                color = green, style=patchnogrid, filled = true):
   MNo   := plot( [[meo ,-h2],[meo-h2,0],[meo+h2,0],[meo,-h2]],
                CG, style=patchnogrid, filled = true):
   MN2o  := plot( [[meo , -h-h2],[meo,0]], CG, linestyle = 3):
   Ln    := plot( [[mn,0],[mx+.5,0]], color = blue, thickness = 4):
   SDO   := plot( [[meo-sdo,-h-h2],[meo-sdo,-h],
                   [meo+sdo,-h],[meo+sdo,-h-h2]],
                   CG, linestyle = 3):
    #------------ TEXT ------------------
   Tmn  := plots[textplot]( [mn, -h, mn],
             align={BOTTOM,RIGHT},font=[TIMES,ROMAN,12],color = black):
   Tmx  := plots[textplot]( [mx, -h, mx],
             align={BOTTOM,LEFT},font=[TIMES,ROMAN,12], color=black):
  
   #------------ Original Distribution Points ------------------
   PP||1 := pointplot( [dataset[1],-3-h3],  CG):
   y := h3;
   for i from 2 to nops(dataset) do
       if (dataset[i]=dataset[i-1]) then y := y+h3; else y:= h3; fi;
       PP||i := pointplot( [dataset[i],-3-y ], symbolsize = 15, CG ):  
   od:
   #------------ Normal Curves ------------------
   #left := evalf( me-2*sd);
   #right := evalf( me+2*sd);
   f := (x,n) -> 40*exp( -((x-meo)^2)/ (2*(sdo^2)/sqrt(n)) )
                  /((sdo/sqrt(n))*sqrt(2*Pi));
   NP := plot([ f(x,5*k) $ k = 1..7], x =mn..mx, color = black);
   #-----------------------------------------------------------------
   display([ MNo,MN2o,Ln,LE,RE, SDO, Tmn, Tmx, NP,       
          seq(PP||i, i = 1..nops(dataset))
         ], scaling = constrained, axes = none );
   #-----------------------------------------------------------------
end proc :

>   

  1. Arbitrary Data Distributions


Any collection of data values can be expressed graphically, by drawing one cell for each occurrences of a particular data value at its location on the x-axis, stacking them if there are multiple occurrences at the same value.

>    data;    

[3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24]

>    DataDist(data);

[Maple Plot]


There is a box for each data value. The minimum is 3 and maximum is 24. The downward facing green triangle is the location of the mean. The dashed line and numbers above it, show one standard deviation above the mean, 19.5, and one standard deviation below the mean, 5.89. This is a visual representation of the original data distribution.


Notice that these distributions can be quite different. The next distribution is evenly distributed. Each data value has a frequency of 1, and every data value in the range is covered.

>    evenly_distributed_data;

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]

>    DataDist(evenly_distributed_data);

[Maple Plot]


This next data set is another extreme - where all of the values are one value or another, with nothing in between.

>    two_extremes_data;

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25]

>    DataDist(two_extremes_data);

[Maple Plot]


Here is another set, which is larger.

>    big_data_set;

[42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, ...
[42, 42, 42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, ...

>    DataDist(big_data_set);

[Maple Plot]


 

  2. Sample from a Distribution


Given a "population" which has the data values of any of the distributions we saw above, we can randomly choose a sample among that population. Each time we do it, we'll get a different sample.


Here is the first data population, and a number of samples of five taken from this population.

>    data;

[3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24]

>    Samples(data, 5);

[10, 8, 21, 15, 3]

>    Samples(data, 5);

[15, 8, 15, 3, 8]

>    Samples(data, 5);

[16, 8, 21, 8, 24]

>    Samples(data, 5);

[16, 10, 10, 21, 24]

>    Samples(data, 5);

[10, 8, 10, 8, 8]

>    Samples(data, 5);

[3, 16, 18, 15, 4]

>    Samples(data, 5);

[18, 8, 8, 18, 8]

Notice that each sample is different.

We can also take larger samples .... 15... or even 30 - which is more than the number of elements in the original data set! This is like rolling a die with six sides 8 times or 20 times or 200 times ... the sample may be larger than the population

>    SampleMean(data, 15);

12.6667

>    Samples(data, 30);  

[18, 8, 23, 23, 3, 23, 16, 21, 13, 24, 16, 8, 3, 3, 13, 10, 21, 21, 23, 4, 21, 15, 21, 21, 24, 8, 4, 13, 24, 21]


Lets look at a few samples from some other data sets - here is the evenly distributed set.

>    evenly_distributed_data;

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]

>    Samples(evenly_distributed_data,7);

[6, 12, 22, 25, 15, 3, 2]

>    Samples(evenly_distributed_data,7);

[26, 25, 2, 21, 3, 26, 24]

>    Samples(evenly_distributed_data,7);

[15, 17, 21, 13, 13, 11, 18]

>    Samples(evenly_distributed_data,7);

[8, 22, 25, 26, 20, 21, 3]


And here are some samples from the data set with only two values. Obviously all of the members of the samples will only consist of those same two values, but how many of each is still variable.

>    two_extremes_data;

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25]

>    Samples(two_extremes_data,7);

[25, 3, 25, 25, 3, 25, 25]

>    Samples(two_extremes_data,7);

[3, 3, 3, 25, 3, 25, 25]

>    Samples(two_extremes_data,7);

[3, 3, 3, 3, 3, 3, 25]

>    Samples(two_extremes_data,7);

[25, 3, 25, 25, 25, 3, 3]


 

  3. Sample Means


Now the next step is to take these samples that we pluck from a population, and compute the mean for each sample. You are no doubt an expert in computing means by now .... add up the numbers and divide by how many numbers there are.

>    Samples(data, 5);
%[1] + %[2] + %[3] + %[4] +%[5];
`sample mean for this sample` = evalf(%/5);

[21, 18, 16, 23, 16]

94

`sample mean for this sample` = 18.80000000

>    Samples(data, 5);
%[1] + %[2] + %[3] + %[4] +%[5];
`sample mean for this sample` = evalf(%/5);

[15, 10, 10, 8, 16]

59

`sample mean for this sample` = 11.80000000

>    Samples(data, 5);
%[1] + %[2] + %[3] + %[4] +%[5];
`sample mean for this sample` = evalf(%/5);

[21, 8, 8, 3, 10]

50

`sample mean for this sample` = 10.



We can also do this is an automated way ... snag a sample and compute its mean. These means will differ because the underlying samples differ.

>    SampleMean(data, 5);

13.

>    SampleMean(data, 5);

10.8000

>    SampleMean(data, 5);

6.80000

>    SampleMean(data, 5);

11.2000

>    SampleMean(data, 5);

13.2000

>    SampleMean(data, 5);

12.8000


What can we say about the collection of sample means we are creating?
 

  4. Distribution of Sample Means


We started with a population of data values, then taken samples of it, and computed the mean of the sample. The next step is to look at what happens to this new data set ... the set of sample means.

Lets start out with the evenly distributed data set.

>    evenly_distributed_data;

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]

>    DataDist(evenly_distributed_data);

[Maple Plot]

To review ... there is a box for each data value. The minimum is 2 and maximum is 26. The downward facing green triangle is the location of the mean. The dashed line and numbers above it, show one standard deviation above the mean, and one standard deviation below the mean. This is a visual representation of the original data distribution.

Lets take 10 samples of sample size 4 from this distribution. We'll plot the original data on the bottom as before, but then we'll also plot the distribution of sample means above it.

>    SampleMeanDist(evenly_distributed_data, 10, 4);

[Maple Plot]


There is a blue box for each sample mean. The mean of these sample means is indicated by the upward facing small red triangle. The dashed red line and numbers indicate one standard deviation above and below this mean. When we say "one standard deviation" we are referring to the data of the sample mean distribution not the original data.

What do you notice? Here are a few observations:

    1. The mean of the sample means is relatively close to the mean of the original distribution. We will see
        that it gets closer and closer - the more samples we take, and the larger sample size we use. (see below)

    2. The standard deviation of the sample means is quite a bit smaller than the standard deviation of the
        original data. You may also notice that the range of the sample means is much smaller than the range
        of the original data. These are both indications that the sample means are clustered around the mean
       much closer than the original data is.

Let see how this changes if we take twice as many samples, 20, of the same size 4.

>    SampleMeanDist(evenly_distributed_data, 20, 4);

[Maple Plot]


Or the same number of samples as before, 10, but now with sample size of 12 instead of 4.

>    SampleMeanDist(evenly_distributed_data,  10, 12);

[Maple Plot]



Hopefully, in both cases you saw that there are "improvements" to the sample mean distribution - in that the means are closer, and the sample mean standard deviation should be smaller. We'll get even better results if we increase both.

>    SampleMeanDist(evenly_distributed_data,  20, 12);

[Maple Plot]



       Another data set ....


What would happen if we tried the same thing but with the data set of only two extreme values?

>    two_extremes_data;

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25]

Samples, and sample means look like this :

>    Samples(two_extremes_data,7);
%[1] + %[2] + %[3] + %[4] + %[5]+ %[6] +%[7];
`sample mean for this sample` = evalf(%/5);

[25, 25, 25, 3, 3, 25, 25]

131

`sample mean for this sample` = 26.20000000

>    Samples(two_extremes_data,7);
%[1] + %[2] + %[3] + %[4] + %[5]+ %[6] +%[7];
`sample mean for this sample` = evalf(%/5);

[3, 25, 3, 3, 3, 3, 3]

43

`sample mean for this sample` = 8.600000000


Lets take 10 samples of size 3, then increase the number of samples and sample size. We should see a progression. The bottom chart (in green) will not change because it refers to the original data distribution. However, we should see the data on the top (in blue) becoming clusters more and more toward the center, with the red lines of the standard deviation shrinking.

>    SampleMeanDist(two_extremes_data, 10, 3);

[Maple Plot]

>    SampleMeanDist(two_extremes_data, 15, 6);

[Maple Plot]

>    SampleMeanDist(two_extremes_data, 20, 9);

[Maple Plot]

>    SampleMeanDist(two_extremes_data, 3, 12);

[Maple Plot]



      Other data sets ....

>    SampleMeanDist(data, 10, 5);

[Maple Plot]

>    SampleMeanDist(data, 20, 10);

[Maple Plot]

>    SampleMeanDist( big_data_set, 10, 5);

[Maple Plot]

>    SampleMeanDist( big_data_set, 20, 10);

[Maple Plot]


 

  5. The Central Limit Theorem


The central limit says that the distribution of sample means is normally distributed - no matter what the distribution of the original data is! This is particularly handy when we don't know the true mean (which we did know above). If we can take samples, and find their means, and then take the means of the sample means, this value will be a good approximation to the true mean. The key to this is the sample size. The standard deviation of the sample means, is the standard deviation of the original data, divided by the square root of n, the sample size. Thus as the sample size increases, the standard deviation of the sample means will decrease.

Here are some plots which show the original data along with the normal curves for the sample means for n = 5, 10, 15, 20, 25, 30, 35.

>    Centralplot(data);

[Maple Plot]

>    Centralplot(two_extremes_data);

[Maple Plot]

>    Centralplot(evenly_distributed_data);

[Maple Plot]

>    Centralplot(big_data_set);

[Maple Plot]


 


          2002 Waterloo Maple Inc & Gregory Moore, all rights reserved.