Subsets of DataFrames - Maple Help

Subsets of DataFrames

Subsetting is an important component of data manipulation. The DataFrame has a powerful indexable structure that makes it possible to access subsets of data that meet given criteria.

The following example worksheet gives several examples for subsetting data frames.

 > restart;
 > interface(rtablesize = 15):

Removing Variables or Observations by Indexing

A common operation when subsetting data frames is simply to remove certain rows or columns.

To begin, we load the canada_crimes data set. This data set features six variables and 13 rows of observations on aggregated crime statistics per 100,000 people collected in 2014.

  (1)

The variables for this DataFrame are:

 > ColumnLabels( data );
 $\left[{\mathrm{Violent Crime}}{,}{\mathrm{Property Crime}}{,}{\mathrm{Other Criminal Code}}{,}{\mathrm{Criminal Code Traffic}}{,}{\mathrm{Federal Statute}}\right]$ (2)

In order to return subsets of the DataFrame, we can simply index into the DataFrame. For example, to return the subset of data contained in the "Violent Crime" DataSeries:

 > data[ Violent Crime ];
  (3)

It is also possible to index the columns of the DataFrame using the integer index values. The following returns the second and third columns:

 > data[ 2..3 ];
  (4)

The following returns the second, third, and fifth columns:

 > data[ [ 2, 3, 5 ] ];
  (5)

It is also possible to subset the data by indexing the DataFrame by certain rows. The following returns the row of observations corresponding for "Ontario":

 > data[ Ontario, .. ];
 $\left[\begin{array}{cc}{\mathrm{Violent Crime}}& {786.62}\\ {\mathrm{Property Crime}}& {2292.66}\\ {\mathrm{Other Criminal Code}}& {476.48}\\ {\mathrm{Criminal Code Traffic}}& {211.57}\\ {\mathrm{Federal Statute}}& {258.15}\end{array}\right]$ (6)

Similar to indexing by columns, it is also possible to use the row numbers:

 > data[ 2, .. ];
 $\left[\begin{array}{cc}{\mathrm{Violent Crime}}& {824.43}\\ {\mathrm{Property Crime}}& {3294.3}\\ {\mathrm{Other Criminal Code}}& {572.18}\\ {\mathrm{Criminal Code Traffic}}& {348.64}\\ {\mathrm{Federal Statute}}& {215.34}\end{array}\right]$ (7)

Filtering Observations

While the index notation for subsetting data frames is powerful for retrieving observations in known rows or columns, it is often more desirable to return rows corresponding to observations that meet a given criteria. For example, say we wanted to known which Canadian province or territory has a "Criminal Code traffic" rate that is greater than 500 per 100,000.

To begin, we will return the "Criminal Code Traffic" column and simply read off the corresponding rows:

 > data[ Criminal Code Traffic ];
  (8)

Now this approach is fine for smaller data frames, but it is much easier to simply query a DataFrame using an element-wise logical operator to first see which (if any) observations match the criteria:

 > data[ Criminal Code Traffic ] >~ 500;
  (9)

This returns a truth table, whose entries return a true, false, or FAIL result depending on if the given observation meets the criteria. In addition, if the DataFrame is indexed by a truth table, a filtered subset is returned:

 > data[ data[ Criminal Code Traffic ] >~ 500 ];
 $\left[\begin{array}{cccccc}{}& {\mathrm{Violent Crime}}& {\mathrm{Property Crime}}& {\mathrm{Other Criminal Code}}& {\mathrm{Criminal Code Traffic}}& {\mathrm{Federal Statute}}\\ {\mathrm{Quebec}}& {940.52}& {2100.84}& {450.29}& {511.18}& {314.74}\\ {\mathrm{Saskatchewan}}& {1963.46}& {5627.55}& {2913.78}& {886.34}& {692.9}\\ {\mathrm{Yukon}}& {4546.7}& {9353.6}& {10019.17}& {1689.95}& {1013.42}\\ {\mathrm{Northwest Territories}}& {6911.49}& {23171.26}& {13834.45}& {1535.89}& {1331.87}\\ {\mathrm{Nunavut}}& {7934.95}& {13778.87}& {8902.56}& {639.61}& {759.87}\end{array}\right]$ (10)

The with command is useful for simplifying the syntax for querying DataFrames. with creates named variables corresponding to each of the column labels in a given DataFrame.

 > with( data );
 $\left[{\mathrm{Violent Crime}}{,}{\mathrm{Property Crime}}{,}{\mathrm{Other Criminal Code}}{,}{\mathrm{Criminal Code Traffic}}{,}{\mathrm{Federal Statute}}\right]$ (11)

Each column of the DataFrame can be called using its variable name:

 > Criminal Code Traffic;
  (12)

With bound labels, the following returns the rows where the "Federal Statute" rate is less than or equal to 300 per 100,000:

 > Federal Statute <=~ 300;
  (13)
 > data[ Federal Statute <=~ 300 ];
 $\left[\begin{array}{cccccc}{}& {\mathrm{Violent Crime}}& {\mathrm{Property Crime}}& {\mathrm{Other Criminal Code}}& {\mathrm{Criminal Code Traffic}}& {\mathrm{Federal Statute}}\\ {\mathrm{Newfoundland and Labrador}}& {1276.15}& {3317.03}& {1010.67}& {348.97}& {267.94}\\ {\mathrm{Prince Edward Island}}& {824.43}& {3294.3}& {572.18}& {348.64}& {215.34}\\ {\mathrm{New Brunswick}}& {1164.32}& {2611.17}& {712.02}& {298.71}& {283.45}\\ {\mathrm{Ontario}}& {786.62}& {2292.66}& {476.48}& {211.57}& {258.15}\end{array}\right]$ (14)

It is also possible to filter the DataFrame using multiple queries. When combining queries, the logical operators and and or are used to find either the intersection or union of truth tables, respectively. For example, the following returns the province or territory with "Violent Crime" less than 1000 and "Property Crime" greater than 3000.

 > Violent Crime <~ 1000 and Property Crime >~ 3000;
  (15)

From the truth table, only Prince Edward Island matches this criteria.

 > data [ Violent Crime <~ 1000 and Property Crime >~ 3000 ];
 $\left[\begin{array}{cccccc}{}& {\mathrm{Violent Crime}}& {\mathrm{Property Crime}}& {\mathrm{Other Criminal Code}}& {\mathrm{Criminal Code Traffic}}& {\mathrm{Federal Statute}}\\ {\mathrm{Prince Edward Island}}& {824.43}& {3294.3}& {572.18}& {348.64}& {215.34}\end{array}\right]$ (16)

It can be useful to find the union of queries by using the or logical operator. For example, the following returns observations for which the "Other Criminal Code" rate is greater than 2500 per 100,000 or the observations for which the "Criminal Code Traffic" rate is greater than 500 per 100,000:

 > data [ Other Criminal Code >~ 2500 or Criminal Code Traffic >~ 500 ];
 $\left[\begin{array}{cccccc}{}& {\mathrm{Violent Crime}}& {\mathrm{Property Crime}}& {\mathrm{Other Criminal Code}}& {\mathrm{Criminal Code Traffic}}& {\mathrm{Federal Statute}}\\ {\mathrm{Quebec}}& {940.52}& {2100.84}& {450.29}& {511.18}& {314.74}\\ {\mathrm{Saskatchewan}}& {1963.46}& {5627.55}& {2913.78}& {886.34}& {692.9}\\ {\mathrm{Yukon}}& {4546.7}& {9353.6}& {10019.17}& {1689.95}& {1013.42}\\ {\mathrm{Northwest Territories}}& {6911.49}& {23171.26}& {13834.45}& {1535.89}& {1331.87}\\ {\mathrm{Nunavut}}& {7934.95}& {13778.87}& {8902.56}& {639.61}& {759.87}\end{array}\right]$ (17)