Iris Data - Maple Programming Help

Home : Support : Online Help : Applications and Example Worksheets : Statistics : examples/IrisData

Iris Data

The Iris data set contains measurements in centimeters for the variables sepal length and width, and petal length and width, for 150 flowers from 3 species of iris, Iris setosa, versicolor, and virginica. The data was collected over several years by Edgar Anderson, who used the data to show that the measurements could be used to differentiate between different species of irises.

The following example discusses techniques for analyzing the Iris data set.

Getting Started

While any command in the package can be referred to using the long form, for example, Statistics:-PCA, it is often easier to load the package and then use the short form command names.

 > restart;
 > with(Statistics):

Importing and summarizing the data

The "Iris" dataset is available in the datasets directory of Maple's data directory. By default, the Import command returns a dataframe object when importing csv files.

 > IrisData := Import("datasets/iris.csv", base = datadir);
 ${\mathrm{IrisData}}{≔}\left[\begin{array}{cccccc}{}& {\mathrm{Sepal Length}}& {\mathrm{Sepal Width}}& {\mathrm{Petal Length}}& {\mathrm{Petal Width}}& {\mathrm{Species}}\\ {1}& {5.1}& {3.5}& {1.4}& {0.2}& {"setosa"}\\ {2}& {4.9}& {3}& {1.4}& {0.2}& {"setosa"}\\ {3}& {4.7}& {3.2}& {1.3}& {0.2}& {"setosa"}\\ {4}& {4.6}& {3.1}& {1.5}& {0.2}& {"setosa"}\\ {5}& {5}& {3.6}& {1.4}& {0.2}& {"setosa"}\\ {6}& {5.4}& {3.9}& {1.7}& {0.4}& {"setosa"}\\ {7}& {4.6}& {3.4}& {1.4}& {0.3}& {"setosa"}\\ {8}& {5}& {3.4}& {1.5}& {0.2}& {"setosa"}\\ {9}& {4.4}& {2.9}& {1.4}& {0.2}& {"setosa"}\\ {10}& {4.9}& {3.1}& {1.5}& {0.1}& {"setosa"}\\ {\mathrm{:}}& {\mathrm{:}}& {\mathrm{:}}& {\mathrm{:}}& {\mathrm{:}}& {"150 x 5 DataFrame"}\end{array}\right]$ (1)

The import commands displays a summary of the first 8 rows of the dataset as well as the row and column labels. This dataframe contains 4 columns of floating point data and one column of strings for the plant "Species".

The Describe command prints a brief description for the structure of the imported data:

 > Describe( IrisData );
 IrisData :: DataFrame: 150 observations for 5 variables Sepal Length:  Type: anything  Min: 4.300000  Max: 7.900000 Sepal Width:   Type: anything  Min: 2.000000  Max: 4.400000 Petal Length:  Type: anything  Min: 1.000000  Max: 6.900000 Petal Width:   Type: anything  Min: 0.100000  Max: 2.500000 Species:       Type: anything  Tally: ["setosa" = 50, "versicolor" = 50, "virginica" = 50]

From the dataframe, you can see that the column labels are:

 > CLabels := ColumnLabels( IrisData );
 ${\mathrm{CLabels}}{≔}\left[{\mathrm{Sepal Length}}{,}{\mathrm{Sepal Width}}{,}{\mathrm{Petal Length}}{,}{\mathrm{Petal Width}}{,}{\mathrm{Species}}\right]$ (2)

The DataSummary command shows summary statistics for the numeric columns of the dataset:

 > interface(displayprecision=4):
 > DataSummary( IrisData[ CLabels[1 .. 4] ], summarize = embed ):

 Sepal Length Sepal Width Petal Length Petal Width mean ${5.8433}$ ${3.0573}$ ${3.7580}$ ${1.1993}$ standarddeviation ${0.82807}$ ${0.43587}$ ${1.7653}$ ${0.76224}$ skewness ${0.31071}$ ${0.31471}$ ${-}{0.27122}$ ${-}{0.10159}$ kurtosis ${2.4103}$ ${3.1598}$ ${1.5938}$ ${1.6528}$ minimum ${4.3000}$ ${2.0000}$ ${1.0000}$ ${0.10000}$ maximum ${7.9000}$ ${4.4000}$ ${6.9000}$ ${2.5000}$ cumulativeweight ${150.0000}$ ${150.0000}$ ${150.0000}$ ${150.0000}$

To summarize the column of strings, you can list the distinct elements by collapsing the column into a set:

 > convert( IrisData[ Species ], set );
 $\left\{{"setosa"}{,}{"versicolor"}{,}{"virginica"}\right\}$ (3)

Note that DataSummary returns a summary for all rows of the dataframe. The Aggregate command can be used to give aggregate statistics for the three distinct levels (factors) found in the "Species" column. By default, the Aggregate command returns the mean for each factor:

 > Aggregate( IrisData, Species );
 $\left[\begin{array}{cccccc}{}& {\mathrm{Sepal Length}}& {\mathrm{Sepal Width}}& {\mathrm{Petal Length}}& {\mathrm{Petal Width}}& {\mathrm{Species}}\\ {1}& {5.0060}& {3.4280}& {1.4620}& {0.2460}& {"setosa"}\\ {2}& {5.9360}& {2.7700}& {4.2600}& {1.3260}& {"versicolor"}\\ {3}& {6.5880}& {2.9740}& {5.5520}& {2.0260}& {"virginica"}\end{array}\right]$ (4)

Aggregate can return any summary statistic and tally up the number of observations for each factor level:

 > Aggregate( IrisData, Species, function = StandardDeviation, tally );
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Sepal Length}}& {\mathrm{Sepal Width}}& {\mathrm{Petal Length}}& {\mathrm{Petal Width}}& {\mathrm{Species}}& {\mathrm{Tally}}\\ {1}& {0.3525}& {0.3791}& {0.1737}& {0.1054}& {"setosa"}& {50}\\ {2}& {0.5162}& {0.3138}& {0.4699}& {0.1978}& {"versicolor"}& {50}\\ {3}& {0.6359}& {0.3225}& {0.5519}& {0.2747}& {"virginica"}& {50}\end{array}\right]$ (5)

In order to visually detect patterns between variables, the variables can be plotted against one another using the GridPlot command. Note that for the upper triangle in the grid of plots, the colorscheme option is passed to plots:-pointplot using the valuesplit option. The valuesplit option splits the "Species" column into three levels and colors points accordingly.

 > GridPlot(IrisData[ CLabels[1 .. 4] ],    upper = [plots:-pointplot, colorscheme = ["valuesplit", IrisData[Species]], symbol = solidcircle, symbolsize = 20],    lower = '(x) -> Statistics:-PieChart([" " = abs(x), " " = 1 - abs(x)], color = ["CornflowerBlue", "WhiteSmoke"], title = evalf(x), size = [100, 100])',    correlation = [false, true, false], width = 600, widthmode = pixels);
 ${"Tabulate"}$ (6)

 $\mathrm{Sepal Length}$    $\mathrm{Sepal Width}$    $\mathrm{Petal Length}$    $\mathrm{Petal Width}$

In the above grid of plots, the lower triangle contains a series of piecharts that indicate the value for the correlation between corresponding columns. This type of plot is otherwise known as a correlogram and from this, it can be observed that the "Petal Length" and "Petal Width" columns have a high level of correlation.

Performing a principal component analysis on the data

A principal component analysis can be run on the data to determine which variables explain the majority of the variability in the data.

 > IrisPCA := PCA(IrisData[ CLabels[1 .. 4] ], summarize):
 summary: Values   proportion of variance  St. Deviation 4.2282     0.9246                 2.0563 0.2427     0.0531                 0.4926 0.0782     0.0171                 0.2797 0.0238     0.0052                 0.1544

The principal component analysis command returns a record, which you can query in order to return the principal components, the rotation matrix, and details on the proportion of variance explained by each component. Note that this can also be seen by using the summarize option as above.

For example, the rotation matrix, or loadings for the components can be returned using the rotation option:

 > IrisPCA:-rotation;
 $\left[\begin{array}{ccccc}{}& {1}& {2}& {3}& {4}\\ {\mathrm{Sepal Length}}& {0.3614}& {-}{0.6566}& {0.5820}& {0.3155}\\ {\mathrm{Sepal Width}}& {-}{0.0845}& {-}{0.7302}& {-}{0.5979}& {-}{0.3197}\\ {\mathrm{Petal Length}}& {0.8567}& {0.1734}& {-}{0.0762}& {-}{0.4798}\\ {\mathrm{Petal Width}}& {0.3583}& {0.0755}& {-}{0.5458}& {0.7537}\end{array}\right]$ (7)

A ScreePlot is useful in visualizing the variance explained by each component:

 > ScreePlot( IrisPCA ); From the ScreePlot, it can be seen that the first component accounts for 92.46% of the variance. The second component accounts for a much smaller fraction of the total variance, suggesting that only one component may be enough to summarize the data.

A Biplot can also be used to show the first two components and the observations on the same diagram. The first principal component is plotted on the x-axis and the second on the y-axis.

 > Biplot(IrisPCA, colorscheme = ["valuesplit", IrisData[ Species ] ]); >

From the Biplot, it can be observed that petal width and length are highly correlated and their variability can be primarily attributed to the first component. Likewise, the first component also explains a large part of the Sepal length. The variability in Sepal width is more attributed to the second component.

 References Fisher, R.A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179-188.