Statistics - Maple Programming Help

Statistics

 PCA
 principal component analysis on data

 Calling Sequence PCA( dataset ) PCA( dataset, options ) PrincipalComponentAnalysis( dataset ) PrincipalComponentAnalysis( dataset, options )

Parameters

 dataset - data set or DataFrame ; Matrix or DataFrame of values with 2 or more columns options - (optional) equation(s) of the form option=value

Options

 • method : one of the names eigenvector or svd; controls if the principal component analysis uses the eigenvector method (on either the covariance or correlation matrix) or the singular values method. By default this is set to svd.
 • columns : non-negative integer; controls the number of dimensions of data to retain. The default is set to the same number of dimensions as the original dataset. Columns of data are discarded based on their effect on the variability in the data with the columns accounting for the most variability being discarded last.
 • tolerance : float; controls the number of dimensions of data to discard. The default is set to 0.0. Columns are discarded if their standard deviations are less than or equal to the value for tolerance multiplied by the standard deviation of the first component.
 • correlation : truefalse; controls the choice of the transformation matrix in the computation. By default, correlation is set to false, and the covariance matrix is used.
 • center : truefalse, numeric, procedure; controls if the returned list of values is centered or not. The default is set to true, which uses the Statistics[Mean] command to compute the center. If this is set to false, the list is not centered and the list of values will have the same central mean value as before. If a numeric value is entered, the list is centered using that value as its center. If a procedure is entered, the list is centered using the value returned from the procedure.  If a procedure is entered as the first value of a list, subsequent arguments in the list are passed to the procedure.
 • scale : truefalse, numeric, procedure, or identical to the name auto; controls if the returned list of values is scaled or not. The default is set to auto. When set to auto, if the correlation matrix is used in the computation, scale is automatically set to true. If the covariance matrix is used, scale is set to false. If scale is set to false, each column is not scaled and therefore has the same standard deviation as before. When set to true, the Statistics[StandardDeviation] command is used to compute the standard deviation. If a numeric value is entered, the list is scaled using that value as its standard deviation. If a procedure is entered, the list is scaled using the value returned from the procedure.  If a procedure is entered as the first value of a list, subsequent arguments in the list are passed to the procedure.
 • output : list; controls the form of the returned solution. The default is record. Output options include:
 – record : a record which contains all of the following output options. Each of the outputs can be queried using :-outputname.
 – values : the singular values or eigenvalues
 – varianceproportion : the proportion of the total variance for each value
 – stdev : the square root of the singular values or eigenvalues
 – rotation : a matrix whose columns correspond to the eigenvectors
 – principalcomponents : a matrix whose columns correspond to the resulting principal components
 • ignore : truefalse; controls how missing data is handled. Missing items are represented by undefined or Float(undefined). If ignore=false and the dataset contains missing data, the PCA command will return undefined. If ignore = true, all rows with any missing items in dataset will be removed. The default value is false. The ignore option is not passed to any procedures entered for either of the center or scale options.
 • summarize : truefalse; controls the display of a printed summary of the singular values or eigenvalues. The default is false

Description

 • The PCA command is used to perform a principal component analysis on a set of data. Principal Component Analysis transforms a multi-dimensional data set to a new set of perpendicular axes (or components) that describe decreasing amounts of variance in the data.
 • The PCA command returns a record containing values for values, varianceproportion, stdev, rotation and principalcomponents by default. The output option can also be used to return specific elements from the principal component analysis.
 • The default method is svd, which is generally the recommended method for numerical accuracy.
 • The PrincipalComponentAnalysis command is provided as an alias.

Notes

 • To print out the results of the principal component analysis, set infolevel[Statistics] to 1, or use the summarize option.

Examples

 > $\mathrm{with}\left(\mathrm{Statistics}\right):$

If infolevel[Statistics] is set to 1, the PCA command will return a printed summary for the results.

 > ${\mathrm{infolevel}}_{\mathrm{Statistics}}≔1:$
 > $\mathrm{data}≔⟨⟨2,4,5,6,8,9⟩|⟨-2,-1,0,1,2,4⟩⟩$
 $\left[\begin{array}{rr}2& -2\\ 4& -1\\ 5& 0\\ 6& 1\\ 8& 2\\ 9& 4\end{array}\right]$ (1)
 > $\mathrm{PCAnalysis}≔\mathrm{PCA}\left(\mathrm{data}\right):$
 summary: Values   proportion of variance  St. Deviation 11.2240     0.9904                 3.3502 0.1093      0.0096                 0.3306

The rotation matrix from the principal component analysis:

 > $\mathrm{PCAnalysis}:-\mathrm{rotation}$
 $\left[\begin{array}{cc}0.7680953681125183& -0.6403354632410225\\ 0.6403354632410225& 0.7680953681125183\end{array}\right]$ (2)

The principal components can be returned using :-principalcomponents.

 > $\mathrm{PCAnalysis}:-\mathrm{principalcomponents}$
 $\left[\begin{array}{cc}-4.523910918388628& 0.29964238358370077\\ -2.3473847189225685& -0.21293317478582607\\ -0.9389538875690274& -0.08517326991433033\\ 0.46947694378451343& 0.0425866349571655\\ 2.6460031432505726& -0.4699889234123611\\ 4.694769437845136& 0.4258663495716526\end{array}\right]$ (3)

The following plot shows the original data set (in red) and the results from the principal component analysis.

 > $\mathrm{plots}:-\mathrm{display}\left(\mathrm{dataplot}\left(\mathrm{data},'\mathrm{points}',\mathrm{color}="Red"\right),\mathrm{dataplot}\left(\mathrm{PCAnalysis}:-\mathrm{principalcomponents},'\mathrm{points}'\right)\right)$

One use for principal component analysis is to eliminate dimensions from the data. The following plot shows the original two dimensions of data, X and Y, as well as the two resulting principal components. It can be observed that from the principal component analysis, the 2nd component has the least effect on the variance, suggesting that it can be removed.

 > $\mathrm{plotopts}≔\mathrm{size}=\left[600,100\right],\mathrm{view}=\left[-10..10,\mathrm{default}\right],\mathrm{symbol}=\mathrm{solidcircle},\mathrm{symbolsize}=20:$
 > $\mathrm{splots}≔\left[\left[\mathrm{Statistics}:-\mathrm{ScatterPlot}\left({\mathrm{data}}_{\left(\right)..\left(\right),1},\mathrm{color}="Red",\mathrm{plotopts},\mathrm{title}="X"\right),\mathrm{Statistics}:-\mathrm{ScatterPlot}\left({\mathrm{PCAnalysis}:-\mathrm{principalcomponents}}_{\left(\right)..\left(\right),1},\mathrm{color}="Orange",\mathrm{plotopts},\mathrm{title}="1st Component"\right)\right],\left[\mathrm{Statistics}:-\mathrm{ScatterPlot}\left({\mathrm{data}}_{\left(\right)..\left(\right),2},\mathrm{color}="DarkBlue",\mathrm{plotopts},\mathrm{title}="Y"\right),\mathrm{Statistics}:-\mathrm{ScatterPlot}\left({\mathrm{PCAnalysis}:-\mathrm{principalcomponents}}_{\left(\right)..\left(\right),2},\mathrm{color}="RoyalBlue",\mathrm{plotopts},\mathrm{title}="2nd Component"\right)\right]\right]:$
 > $\mathrm{DocumentTools}:-\mathrm{Tabulate}\left(\mathrm{splots}\right)$
 ${"Tabulate"}$ (4)

 > ${\mathrm{infolevel}}_{\mathrm{Statistics}}≔0:$

The following example performs a principal component analysis on multi-dimensional data. The components that have the least impact on the variance are discarded, and the simplified data is reconstructed from the remaining components.

 > $\mathrm{data2}≔⟨⟨2.5|2.4|10.5⟩,⟨0.5|0.7|0.785⟩,⟨2.2|2.9|1.286⟩,⟨1.9|2.2|2.35⟩,⟨3.1|3.0|2.202⟩,⟨2.3|2.7|1.351⟩,⟨2.0|1.6|2.021⟩,⟨1.0|1.1|1.247⟩,⟨1.5|1.6|2.503⟩,⟨1.1|0.9|1.214⟩⟩$
 $\left[\begin{array}{ccc}2.5& 2.4& 10.5\\ 0.5& 0.7& 0.785\\ 2.2& 2.9& 1.286\\ 1.9& 2.2& 2.35\\ 3.1& 3.0& 2.202\\ 2.3& 2.7& 1.351\\ 2.0& 1.6& 2.021\\ 1.0& 1.1& 1.247\\ 1.5& 1.6& 2.503\\ 1.1& 0.9& 1.214\end{array}\right]$ (5)

The columns option keeps the columns of data with the greatest variance. Here we discard the component with the least amount of impact on the variability of the dataset.

 > $\mathrm{PCAnalysis2}≔\mathrm{PCA}\left(\mathrm{data2},\mathrm{columns}=2,\mathrm{summarize}=\mathrm{true}\right):$
 summary: Values   proportion of variance  St. Deviation 8.3208     0.8782                 2.8846 1.1119     0.1173                 1.0545 0.0425     0.0045                 0.2061
 > $\mathrm{PCAnalysis2}:-\mathrm{principalcomponents}$
 $\left[\begin{array}{cc}7.988187958806683& 0.4132208576658364\\ -2.0180755842989333& 1.4796982783347263\\ -1.1005453745704648& -1.1860396370662412\\ -0.15437575221897967& -0.3050805395865079\\ -0.07464758865229935& -1.701349118990769\\ -1.0432107273760673& -1.091188102113229\\ -0.5246774108110052& 0.02801275103728596\\ -1.4612653606551966& 0.9287708944738897\\ -0.11067536617282643& 0.4254297822691999\\ -1.5007147940509051& 1.008524833975807\end{array}\right]$ (6)

The data can be reconstructed using the principal components:

 > $\mathrm{EVectors}≔{\mathrm{PCAnalysis2}:-\mathrm{rotation}}_{1..3,1..2}$
 $\left[\begin{array}{cc}0.12403523407937907& -0.6463230214959147\\ 0.09631199526958398& -0.7473506291701019\\ 0.9875926590827139& 0.15405709643974397\end{array}\right]$ (7)
 > $\mathrm{TempMatrix}≔\mathrm{.}\left(\mathrm{EVectors},{\mathrm{PCAnalysis2}:-\mathrm{principalcomponents}}^{\mathrm{%T}}\right):$
 > $\mathrm{RecData}≔\mathrm{Matrix}\left(\mathrm{upperbound}\left(\mathrm{data2}\right),\left(i,j\right)→{\mathrm{TempMatrix}}_{j,i}+{\mathrm{Mean}\left(\mathrm{data2}\right)}_{j}\right)$
 $\left[\begin{array}{ccc}2.5337426100689475& 2.3705374529383647& 10.498635393010748\\ 0.6033244603559988& 0.6097816745759637& 0.7808213878394505\\ 2.4400583186927456& 2.6903917480725412& 1.2762916443379002\\ 1.9880325435824946& 2.123133896490078& 2.3464398182591957\\ 2.900362172073577& 3.174314876310583& 2.210073684126728\\ 2.3858651044140053& 2.625026408017028& 1.3475274730858982\\ 1.726816228643063& 1.8385319245794116& 2.0320480037846194\\ 1.0284655981452167& 1.0751451051570298& 1.2458488041054643\\ 1.521307292739354& 1.581395419210117& 2.5021382978218103\\ 0.9720256712845996& 1.0117414946488823& 1.2191754936281867\end{array}\right]$ (8)
 > $\mathrm{plots}:-\mathrm{display}\left(\mathrm{dataplot}\left(\mathrm{data2},'\mathrm{points}',\mathrm{color}="Red"\right),\mathrm{dataplot}\left(\mathrm{convert}\left(\mathrm{RecData},\mathrm{Matrix}\right),'\mathrm{points}'\right)\right)$

The correlation option is used to compute the principal components using the correlation matrix instead of the covariance matrix. This is often done while using the eigenvector method.

 > $\mathrm{PCA}\left(\mathrm{data2},\mathrm{method}=\mathrm{eigenvector},\mathrm{correlation}=\mathrm{true},\mathrm{output}=\left[\mathrm{stdev},\mathrm{principalcomponents}\right]\right)$
 $\left[\begin{array}{c}{1.45983908817652}\\ {0.897336215225643}\\ {0.252304485644745}\end{array}\right]{,}\left[\begin{array}{ccc}{-2.04541064435793}& {2.16534111053998}& {0.0729540898479645}\\ {2.25951639743765}& {0.275289463146514}& {0.162010188630925}\\ {-0.901882664152099}& {-0.919228152471263}& {0.388677138699832}\\ {-0.267541733683770}& {-0.207824745166884}& {0.142997286023826}\\ {-1.86275965676923}& {-0.892908941712997}& {-0.323207958436754}\\ {-0.844508446664937}& {-0.838741654702193}& {0.138096255933807}\\ {0.145431463540588}& {-0.0841358358830445}& {-0.445280563739075}\\ {1.47259141480503}& {0.129163959227000}& {0.0420925855261004}\\ {0.501121236599885}& {0.194747203264653}& {0.0340118771919902}\\ {1.54344263324481}& {0.178297593758235}& {-0.212350899678612}\end{array}\right]$ (9)

A Scree Plot is often used to visually determine which principal components explain the majority of the variance.

 > $\mathrm{data3}≔\mathrm{DataFrame}\left(⟨⟨2.5|2.4|10.5|0.1|0.5⟩,⟨0.5|0.7|0.785|4.3|2.0⟩,⟨2.2|2.9|1.286|5.4|7.0⟩,⟨1.9|2.2|2.35|6.7|3.1⟩,⟨3.1|3.0|2.202|8.1|12⟩,⟨0.1|0.4|0.5|0.6|0.9⟩⟩,\mathrm{columns}=\left[a,b,c,d,e\right]\right)$
 ${\mathrm{DataFrame}}{}\left(\left[\begin{array}{ccccc}2.5& 2.4& 10.5& 0.1& 0.5\\ 0.5& 0.7& 0.785& 4.3& 2.0\\ 2.2& 2.9& 1.286& 5.4& 7.0\\ 1.9& 2.2& 2.35& 6.7& 3.1\\ 3.1& 3.0& 2.202& 8.1& 12\\ 0.1& 0.4& 0.5& 0.6& 0.9\end{array}\right]{,}{\mathrm{rows}}{=}\left[{1}{,}{2}{,}{3}{,}{4}{,}{5}{,}{6}\right]{,}{\mathrm{columns}}{=}\left[{a}{,}{b}{,}{c}{,}{d}{,}{e}\right]\right)$ (10)
 > $\mathrm{PCAnalysis3}≔\mathrm{PCA}\left(\mathrm{data3},\mathrm{summarize}=\mathrm{true}\right):$
 summary: Values   proportion of variance  St. Deviation 31.4897     0.6658                 5.6116 13.2270     0.2797                 3.6369 2.3575      0.0498                 1.5354 0.2180      0.0046                 0.4669 0.0031      0.0001                 0.0558

The following plot indicates that the first three components account for approximately 99.5% of the variance.

 > $\mathrm{ScreePlot}\left(\mathrm{PCAnalysis3}\right)$

The tolerance or removecolumns options can be used to remove the components with the least effect on the overall variance. Using tolerance = 0.01 will remove any principal components whose value is at most 0.01 multiplied by the value of the first principal component, namely the last two.

 > $\mathrm{PCA}\left(\mathrm{data3},\mathrm{tolerance}=0.01\right):-\mathrm{principalcomponents}$
 ${\mathrm{DataFrame}}{}\left(\left[\begin{array}{ccc}7.688417521905891& -5.455950371125237& -0.13955048794036515\\ 1.0628690464303712& 3.2743946304282785& 0.7602540190998637\\ -3.453639245808518& 0.0717574356381341& -0.6081049363711044\\ -0.7358311947032412& 0.7474518990400768& 2.6224958407502235\\ -8.377331436947987& -2.7940268252582605& -0.8323449393950316\\ 3.815515309123484& 4.156373231277006& -1.8027494961435864\end{array}\right]{,}{\mathrm{rows}}{=}\left[{1}{,}{2}{,}{3}{,}{4}{,}{5}{,}{6}\right]{,}{\mathrm{columns}}{=}\left[{1}{,}{2}{,}{3}\right]\right)$ (11)

A Biplot can also be used to show the first two components and the observations on the same set of axes. The first principal component is plotted on the x-axis and the second on the y-axis.

 > $\mathrm{Biplot}\left(\mathrm{PCAnalysis3},\mathrm{pointlabels}=\mathrm{true},\mathrm{points}=\mathrm{false}\right)$
 > 

Compatibility

 • The Statistics[PCA] command was introduced in Maple 2016.