 Statistics - Maple Programming Help

 Statistics

The updates for Statistics in Maple 2015 include several new commands, as well as added support in the context menu for matrix data sets, and new and improved visualizations.

Lowess

Lowess (locally weighted scatterplot smoothing) is used for plotting a smoothed curve or surface and has been an available option for both ScatterPlot and ScatterPlot3D for several releases. In Maple 2015, the new Lowess command returns the function whose graph is the lowess smoothed curve or surface. Returning a function rather than a plot also means that the Lowess command is capable of handling data points in any finite dimension. The lowess algorithm has also been improved to produce better plots and to achieve lower computation times in routines like ScatterPlot and ScatterPlot3D.

Following is an example of the use of Lowess together with ScatterPlot3D. First, generate 300 data points and store them in a 300x3 Matrix, $M$.

$\mathrm{with}\left(\mathrm{Statistics}\right):$

$Y≔\mathrm{Sample}\left(\mathrm{Uniform}\left(-50,50\right),300\right):$

$\mathrm{Zerror}≔\mathrm{Sample}\left(\mathrm{Normal}\left(0,100\right),300\right):$

$Z≔\mathrm{Vector}\left['\mathrm{row}'\right]\left(300,i\to -\left(\mathrm{sin}\left(\frac{{Y}_{i}}{20}\right){\left({X}_{i}-6\right)}^{2}+{\left({Y}_{i}-7\right)}^{2}+{\mathrm{Zerror}}_{i}\right)\right):$

 $\left[\begin{array}{c}{\mathrm{300 x 3}}{\mathrm{Matrix}}\\ {\mathrm{Data Type:}}{\mathrm{anything}}\\ {\mathrm{Storage:}}{\mathrm{rectangular}}\\ {\mathrm{Order:}}{\mathrm{Fortran_order}}\end{array}\right]$ (1.1)

Now, compute the lowess model, $L$, and plot $L$.

$L≔\mathrm{Lowess}\left(M\right):$

Finally, plot the data points by themselves, and display the two plots together.

$Q≔\mathrm{ScatterPlot3D}\left(M\right):$

$\mathrm{plots}:-\mathrm{display}\left(P,Q,\mathrm{lightmodel}=\mathrm{none},\mathrm{orientation}=\left[20,70,0\right],\mathrm{view}=\left[-50..50,-50..50,-4000..1500\right]\right);$ The lowess model can also be used in most other contexts where you can use a procedure. For example, you can numerically integrate the volume between the lowess surface and the plane $z=0$ (this turns out to be negative - there is more volume below zero than above):

 ${-}{918.771690686256}$ (1.2)

In the following example, the data points are two-dimensional (there is one independent and one dependent variable).

$X≔\mathrm{Sample}\left(\mathrm{Uniform}\left(0,50\right),500\right):$

$\mathrm{Zerror}≔\mathrm{Sample}\left(\mathrm{Normal}\left(0,0.0005\right),500\right):$

$L≔\mathrm{Lowess}\left(X,Z,\mathrm{bandwidth}=0.3\right):$

$P≔\mathrm{ScatterPlot}\left(X,Z\right):$

$Q≔\mathrm{plot}\left(L,-0..50\right):$

$\mathrm{plots}:-\mathrm{display}\left(P,Q\right);$ If these data points represent the response to a certain stimulus, you can expect the highest response where the model assumes the highest value. You can find this point with the Maximize command from the Optimization package.

 $\left[{0.0126721516373243}{,}\left[\begin{array}{c}{22.3264562269840}\end{array}\right]\right]$ (1.3)

The maximum, of about 0.012, is assumed when the stimulus is approximately 20. (The number varies a little bit depending on the random samples chosen previously.)

Robust regression

Robust statistics are statistical procedures that give reliable results in the presence of noise. The procedures embodied by the Maple commands Median and HodgesLehmann give robust measures for the location of a data set; those embodied by MedianDeviation, RousseeuwCrouxSn, and RousseeuwCrouxQn give robust measures for the dispersion of a data set. In Maple 2015, there is a new command that performs robust linear regression: RepeatedMedianEstimator.

The setting where this is useful is if one has data points in the plane that include outliers, and one wants to perform linear regression: find an affine-linear expression in the independent variable that is a reasonable description for the dependent variable.

 $\left[\begin{array}{c}{\mathrm{1 .. 500}}{{\mathrm{Vector}}}_{{\mathrm{column}}}\\ {\mathrm{Data Type:}}{\mathrm{anything}}\\ {\mathrm{Storage:}}{\mathrm{rectangular}}\\ {\mathrm{Order:}}{\mathrm{Fortran_order}}\end{array}\right]$ (2.1) You see that most points are close to the upper boundary given by the curve, but there are some points substantially lower than that. The standard, least-squares regression gives those lower points a substantial weight (their distance from the regression line is squared). The repeated median estimator, however, treats the outliers as "abnormal" values that get less weight; it lies closer to the majority of points.

 ${0.0965727031076460}{}{x}{+}{5.58041733925551}$ (2.2)

 ${5.53282063219485}{+}{0.127503338071007}{}{x}$ (2.3) Scale

The Scale command is used to center and scale a sample list or matrix. By default, when a sample is centered, the mean of the list is subtracted from each of the observations. When a sample is scaled, all of the observations are divided by the standard deviation of the sample. This has useful applications in statistics, such as to compute standard scores.

For example, say that you have a list of percent grades in decimal form which are scored out of a possible maximum value of 1.

 $\left[{0.34}{,}{0.55}{,}{0.61}{,}{0.75}{,}{0.80}{,}{0.91}\right]$ (3.1)

$\mathrm{with}\left(\mathrm{Statistics}\right):$

A consequence of scaling and centering a sample of data is that the resulting sample has mean 0 and standard deviation 1 (corresponding to a standard normal distribution). Scaling and centering this list, you see that a score of 0.80 is 0.68 standard deviations above the mean for example:

$\mathrm{Scale}\left(\mathrm{Grades}\right)$

 $\left[\begin{array}{c}{-}{1.57195498378372}\\ {-}{0.540359525675653}\\ {-}{0.245617966216206}\\ {0.442112339189171}\\ {0.687730305405377}\\ {1.22808983108103}\end{array}\right]$ (3.2)

If you assume that the population of all grades is normally distributed, and that you have a representative sample of them here, then these data suggest that the score corresponding to 0.80 is in the top 25% of the grades, as you can see by using a standard normal distribution table:

 ${0.754188683982649}$ (3.3)

The following graph illustrates how various options for the Scale command scale and center samples of data: ${}$

The red line shows the original data and the orange line shows the data minus the center value. The dark blue line shows the data divided by the scale value (the standard deviation of the sample) and the light blue line shows the data minus the center then divided by the scale value (the result of the command used above to compute the standard scores).

Visualizations

There are several notable updates to visualizations in Statistics, including a change to the default axes style for all statistics visualizations to boxed axes. The new dataplot command merges many statistics plots into one convenient place, making it easier than ever to create plots using sample of data. An example of the dataplot command can be seen in the preceding section.

TreeMap is a new visualization that creates a unique view on the structure of data samples. A tree map is a method of data visualization using nested rectangles, in which the area of a rectangle corresponds to the magnitude of the corresponding datum.

$\mathrm{with}\left(\mathrm{Statistics}\right):$

$\mathrm{TreeMap}\left(\left[1=10,2=5,3=15,4=20,5=25\right]\right)$ The BubblePlot command has also been updated to handle TimeSeries objects. When given a time series, BubblePlot will create an animation that uses the corresponding time series to define the position and size of the bubbles over time. The following plot, taken from the BubblePlot help page, shows the 'Services Share of GDP' vs 'Industry Share of GDP' for several countries over a one year time period as an animation. The bubble size corresponds to the GDP (PPP) value. To view the animation, right-click and choose Animation - Play. Code Generation for R

Maple 2015 introduces the CodeGeneration[R] command, making it easy to translate Maple code to R. In addition to being able to translate fundamental programming structures, CodeGeneration[R] can also translate many common unevaluated commands from statistics.

$\mathrm{with}\left(\mathrm{CodeGeneration}\right):$

$R\left('\mathrm{Statistics}:-\mathrm{Mean}\left(\mathrm{Matrix}\left(\left[\left[2,4,8,21\right]\right]\right)\right)'\right)$

 cg <- mean(matrix(c(2,4,8,21),nrow=1,ncol=4))

$R\left('\mathrm{Statistics}:-\mathrm{FivePointSummary}\left(\left[1,3,5,7,9\right]\right)'\right):$

 cg0 <- fivenum(c(1,3,5,7,9))

For more details, see the CodeGeneration updates page.

Several Statistics commands including DataSummary, FivePointSummary, and FrequencyTable have been updated. DataSummary and FivePointSummary have been updated to return a column vector rather than a list, making the results of each command much easier to read:

 > $\mathrm{with}\left(\mathrm{Statistics}\right):$
 > $\mathrm{DataSummary}\left(\mathrm{Sample}\left(\mathrm{Normal}\left(10,5\right),100\right)\right)$
 $\left[\begin{array}{c}{\mathrm{mean}}{=}{10.0147293017982}\\ {\mathrm{standarddeviation}}{=}{4.60247520722689}\\ {\mathrm{skewness}}{=}{0.475553379489542}\\ {\mathrm{kurtosis}}{=}{2.45784639636036}\\ {\mathrm{minimum}}{=}{1.82571622828048}\\ {\mathrm{maximum}}{=}{21.0668979469518}\\ {\mathrm{cumulativeweight}}{=}{100.}\end{array}\right]$ (5.2.1)
 > $\mathrm{FivePointSummary}\left(\mathrm{Sample}\left(\mathrm{Rayleigh}\left(3\right),100\right)\right)$
 $\left[\begin{array}{c}{\mathrm{minimum}}{=}{0.640201262477202}\\ {\mathrm{lowerhinge}}{=}{2.36864919072803}\\ {\mathrm{median}}{=}{3.46043611600668}\\ {\mathrm{upperhinge}}{=}{4.63617294297510}\\ {\mathrm{maximum}}{=}{11.2182079331748}\end{array}\right]$ (5.2.2)

The FrequencyTable command has a new option, headers, which controls the display of a header row of information:

 >
 $\left[\begin{array}{ccccc}{\mathrm{Range}}& {\mathrm{Absolute Frequency}}& {\mathrm{Percentage}}& {\mathrm{Cumulative Frequency}}& {\mathrm{Cumulative Percentage}}\\ {2.}{..}{35.}& {30.}& {30.00000000}& {30.}& {30.00000000}\\ {35.}{..}{68.}& {36.}& {36.00000000}& {66.}& {66.00000000}\\ {68.}{..}{101.}& {34.}& {34.00000000}& {100.}& {100.0000000}\end{array}\right]$ (5.2.3)

Statistics Education

Maple 2015 includes numerous updates geared toward the classroom. This includes a new palette for quick creation of random variables, new commands and tutors for working with common probability distribution tables and tables of critical values, and many new MathApps:

 Chi-Square Distribution Confidence Intervals Normal Approximation of Binomial Distribution Rolling Two Dice Z-Tests And many more...