A Guide to Data Frames - Maple Programming Help

Home : Support : Online Help : Programming : Data Types : DataFrames and DataSeries : DataFrame/Guide

A Guide to Data Frames

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, truefalse, etc., which makes data frames an ideal storage device for heterogeneous data.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index.

The following example page shows a number of common tasks when working with data frames.

Creating a new DataFrame

Data frames are rich containers for information that can store multiple types of data. For example, the following vectors contain information on three berries: Raspberries, Grapes and Strawberries respectively.

Variables:

Genus

 Type: string
 > genus := < "Rubus", "Vitis", "Fragaria" >:

Nutrition: Energy per 100g in kJ

 Type: integer
 > energy := < 220, 288, 136 >:

Nutrition: Carbohydrates per 100g in g

 Type: float
 > carbohydrates := < 11.94, 18.1, 7.68 >:

Tons produced worldwide in 2011

 Type: integer
 > total_tons := < 543421, 58500118, 4594539 >:

Top producing country in 2011

 Type: name
 > top_producer := < Russia, China, USA >:

A DataFrame can combine all of these different types of data into one structure.

 > DF := DataFrame( < energy | carbohydrates | total_tons | top_producer | genus > );
 ${\mathrm{DF}}{≔}\left[\begin{array}{cccccc}{}& {1}& {2}& {3}& {4}& {5}\\ {1}& {220}& {11.94}& {543421}& {\mathrm{Russia}}& {"Rubus"}\\ {2}& {288}& {18.1}& {58500118}& {\mathrm{China}}& {"Vitis"}\\ {3}& {136}& {7.68}& {4594539}& {\mathrm{USA}}& {"Fragaria"}\end{array}\right]$ (1)

It can be observed that in the above DataFrame, the row and column names are equivalent to the index value. The row and column names can be specified using the rows and columns options respectively. It is also useful to specify the types of data in each DataSeries using the datatypes option.

 > DF := DataFrame( < energy | carbohydrates | total_tons | top_producer | genus >,             columns = [ Energy, Carbohydrates, Total Tons, Top Producer, Genus ],             rows = [ Raspberry, Grape, Strawberry ],             datatypes = [ integer, float, integer, anything, string ] );
 ${\mathrm{DF}}{≔}\left[\begin{array}{cccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Top Producer}}& {\mathrm{Genus}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}\end{array}\right]$ (2)

Note: It is important to remember that any vectors included in the DataFrame should have the same length. If the vectors are of different length, an error is returned.

Describing a DataFrame

The Describe command prints a description of the structure of the DataFrame including the number of observations ( rows ), the number of variables ( columns ), as well as the type of each column (if specified). In addition, for numeric columns, the minimum and maximum values are displayed, and for truefalse, string or name columns, the distinct levels are given.

 > Describe( DF );
 DF :: DataFrame: 3 observations for 5 variables Energy:         Type: integer   Min: 136.000000  Max: 288.000000 Carbohydrates:  Type: float[8]  Min: 7.680000  Max: 18.100000 Total Tons:     Type: integer   Min: 543421.000000  Max: 58500118.000000 Top Producer:   Type: anything  Tally: [Russia = 1, USA = 1, China = 1] Genus:          Type: string    Tally: ["Rubus" = 1, "Vitis" = 1, "Fragaria" = 1]

Indexing entries in data frames

Unlike matrices, data frames can be indexed using the column or row names. For example, to view the energy in kJ for raspberries:

 > DF[Raspberry,Energy];
 ${220}$ (3)
 > DF[1,1];
 ${220}$ (4)

Note that above, the index follows the convention [ row, column ] and a single value is retrieved. If a DataFrame is indexed by a single argument, this indexes by [ column ] and the respective column is returned. For example, to retrieve values for all berries for the Energy variable:

 > DF[Energy];
 $\left[\begin{array}{cc}{\mathrm{Raspberry}}& {220}\\ {\mathrm{Grape}}& {288}\\ {\mathrm{Strawberry}}& {136}\end{array}\right]$ (5)
 > DF[1];
 $\left[\begin{array}{cc}{\mathrm{Raspberry}}& {220}\\ {\mathrm{Grape}}& {288}\\ {\mathrm{Strawberry}}& {136}\end{array}\right]$ (6)

To return an entire row, for example, the entire row for Raspberries as a DataSeries:

 > DF[Raspberry,..];
 $\left[\begin{array}{cc}{\mathrm{Energy}}& {220}\\ {\mathrm{Carbohydrates}}& {11.9400000000000}\\ {\mathrm{Total Tons}}& {543421}\\ {\mathrm{Top Producer}}& {\mathrm{Russia}}\\ {\mathrm{Genus}}& {"Rubus"}\end{array}\right]$ (7)
 > DF[1,..];
 $\left[\begin{array}{cc}{\mathrm{Energy}}& {220}\\ {\mathrm{Carbohydrates}}& {11.9400000000000}\\ {\mathrm{Total Tons}}& {543421}\\ {\mathrm{Top Producer}}& {\mathrm{Russia}}\\ {\mathrm{Genus}}& {"Rubus"}\end{array}\right]$ (8)

For more examples, see Indexing a DataFrame or the Subsets of DataFrames example worksheet.

Changing values in a DataFrame

Changing values in an existing DataFrame can be done in a similar way to other data structures in Maple; simply assign to the index position that you want to change. For example, to change the value for Carbohydrates for Grape:

 > DF[Grape, Carbohydrates] := 20:
 > DF;
 $\left[\begin{array}{cccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Top Producer}}& {\mathrm{Genus}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}\\ {\mathrm{Grape}}& {288}& {20.}& {58500118}& {\mathrm{China}}& {"Vitis"}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}\end{array}\right]$ (9)

Before moving on, we can set this back to its original value:

 > DF[2,2] := 18.1:
 > DF;
 $\left[\begin{array}{cccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Top Producer}}& {\mathrm{Genus}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}\end{array}\right]$ (10)

Details on DataFrame structure

There are many useful commands for querying information about the structure of data frames. For example, numelems returned the total number of elements in the DataFrame:

 > numelems( DF );
 ${15}$ (11)

The upperbound command returns the dimensions of the DataFrame.

 > upperbound( DF );
 ${3}{,}{5}$ (12)

To get the number of rows in the DataFrame, put 1 as the second argument to upperbound:

 > upperbound( DF, 1 );
 ${3}$ (13)

To get the number of columns in the DataFrame, put 2 as the second argument to upperbound:

 > upperbound( DF, 2 );
 ${5}$ (14)

The Datatypes command returns a list of the data type of each of the columns of data:

 > Datatypes( DF );
 $\left[{\mathrm{integer}}{,}{{\mathrm{float}}}_{{8}}{,}{\mathrm{integer}}{,}{\mathrm{anything}}{,}{\mathrm{string}}\right]$ (15)

How to change column names

Data frames can have both row and column names. In the above example, these were added to the DataFrame using the columns and rows options. To retrieve the current list of column names, the ColumnLabels command can be used:

 > ColumnLabels( DF );
 $\left[{\mathrm{Energy}}{,}{\mathrm{Carbohydrates}}{,}{\mathrm{Total Tons}}{,}{\mathrm{Top Producer}}{,}{\mathrm{Genus}}\right]$ (16)

The SubsColumnLabel command returns a DataFrame with a modified column label:

 > DF := SubsColumnLabel( DF, 4, Country );
 ${\mathrm{DF}}{≔}\left[\begin{array}{cccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}\end{array}\right]$ (17)

Adding a new column or row

The Append command appends another column or row to an existing DataFrame. The arguments for the Append command specify the DataFrame to append to, the data to append, and optionally, the orientation and the name for the new data (which can be a DataFrame or DataSeries). The following adds a new column to the DataFrame corresponding to truefalse values for if the given berry is a botanical berry or not:

 > DF := Append( DF, < false, true, false >, label = Botanical Berry );
 ${\mathrm{DF}}{≔}\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\end{array}\right]$ (18)

If we have a new row of information on blackberries such as:

 > NewRow := DataSeries( < 180 | 9.61 | Mexico | "Rubus" | false >,             labels = [ Energy, Carbohydrates, Country, Genus, Botanical Berry ] );
 ${\mathrm{NewRow}}{≔}\left[\begin{array}{cc}{\mathrm{Energy}}& {180}\\ {\mathrm{Carbohydrates}}& {9.61}\\ {\mathrm{Country}}& {\mathrm{Mexico}}\\ {\mathrm{Genus}}& {"Rubus"}\\ {\mathrm{Botanical Berry}}& {\mathrm{false}}\end{array}\right]$ (19)

This row can be added to the existing DataFrame using the Append command.

 > DF := Append( DF, NewRow, mode = row, label = Blackberry );
 ${\mathrm{DF}}{≔}\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (20)

Note: The value of the Total tons produced in 2011 is unknown for this DataSeries. When this is appended to the original DataFrame, since there is no value for Total Tons, the value undefined is added where the data is unknown. For more on missing values, see the Dealing with missing values section.

Removing a column or row

The Remove command removes a selected column or row from the DataFrame. This command is very useful when doing further analysis using commands that require the arguments to be purely numeric, such as visualizations in Statistics, etc.

For example, to remove the non-numeric Genus, Country and Botanical Berry columns:

 > Remove( DF, [ Genus, Country, Botanical Berry ] );
 $\left[\begin{array}{ccc}{5}& {4}& {6}\end{array}\right]$
 $\left[\begin{array}{cccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}\end{array}\right]$ (21)

To remove a row, use the mode option:

 > Remove( DF, Grape, mode = row );
 ${2}$
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (22)

Selecting or Removing values that match a criteria

The remove, select, and selectremove commands are useful when removing (or selecting) rows that contain values that match a given criteria.

For example, the selectremove command returns two results - one that matches the given criteria and one that does not. With the given criteria that the rows contain even values for Total Tons:

 > selectremove( x->type(x,even), DF, Total Tons);
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\end{array}\right]{,}\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (23)

Dealing with missing values

Some datasets may have missing values. In the DataFrame for blackberries, it can be observed that the value for Total Tons is undefined:

 > DF;
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (24)

There are several commands for dealing with missing values in data frames including DropMissing and FillMissing.

FillMissing can be used to fill missing values with another value. For example, here we can fill the undefined value in the DataFrame with 0:

 > FillMissing( DF, 0 );
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {0}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (25)

Note: In order to permanently change the value for any missing values, the result of the FillMissing command must be assigned to the DataFrame.

Removing duplicates

Duplicate rows (or duplicate entries in a given column) do occur in many datasets. Say one was to mistakenly add another row to our DataFrame that contains the same information about blackberries, but the row is added under a misspelled label:

 > DF := Append( DF, DataFrame( << 180 | 9.61 | Mexico | "Rubus" | false >>,            columns = [ Energy, Carbohydrates, Country, Genus, Botanical Berry ],            rows = [ Blackbery ] ) );
 ${\mathrm{DF}}{≔}\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Blackbery}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (26)

By inspecting the DataFrame above, it is easy to see the duplicate entry, but duplicates may not be easy to find by inspection when dealing with larger DataFrames.

The AreDuplicate command returns a DataSeries of truefalse values that indicate if a row has a duplicate or not.

 > AreDuplicate( DF );
 $\left[\begin{array}{cc}{\mathrm{Raspberry}}& {\mathrm{false}}\\ {\mathrm{Grape}}& {\mathrm{false}}\\ {\mathrm{Strawberry}}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {\mathrm{true}}\\ {\mathrm{Blackbery}}& {\mathrm{true}}\end{array}\right]$ (27)

This result can be used to index a DataFrame to show just the duplicate rows:

 > DF[ AreDuplicate( DF ) ];
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Blackbery}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (28)

The AreUnique command can be used to show the unique rows in a DataFrame. If a DataFrame is indexed by the results, the duplicate rows are removed. By default, the AreUnique command marks the first instance of a row as unique and any subsequent instance of a matching row as duplicate.

 > DF[ AreUnique( DF ) ];
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (29)

The keep option controls if the first, last, or none of the duplicates is marked as unique.

 > DF[ AreUnique( DF, keep = last ) ];
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackbery}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (30)
 > DF := DF[ AreUnique( DF, keep = first ) ];
 ${\mathrm{DF}}{≔}\left[\begin{array}{ccccccc}{}& {\mathrm{Energy}}& {\mathrm{Carbohydrates}}& {\mathrm{Total Tons}}& {\mathrm{Country}}& {\mathrm{Genus}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {220}& {11.9400000000000}& {543421}& {\mathrm{Russia}}& {"Rubus"}& {\mathrm{false}}\\ {\mathrm{Grape}}& {288}& {18.1000000000000}& {58500118}& {\mathrm{China}}& {"Vitis"}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {136}& {7.68000000000000}& {4594539}& {\mathrm{USA}}& {"Fragaria"}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {180}& {9.61}& {\mathrm{undefined}}& {\mathrm{Mexico}}& {"Rubus"}& {\mathrm{false}}\end{array}\right]$ (31)

Reordering columns

In the How to change column names section, the ColumnLabels command was used to return the list of variable names. The ColumnLabels command also returns the current order of the columns:

 > ColumnLabels( DF );
 $\left[{\mathrm{Energy}}{,}{\mathrm{Carbohydrates}}{,}{\mathrm{Total Tons}}{,}{\mathrm{Country}}{,}{\mathrm{Genus}}{,}{\mathrm{Botanical Berry}}\right]$ (32)

To reorder columns in a DataFrame, use the DataFrame index notation to specify the new order for the columns and reassign this to the existing DataFrame variable.

 > DF := DF[ [ Genus, Carbohydrates, Energy, Country, Total Tons, Botanical Berry ] ];
 ${\mathrm{DF}}{≔}\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Raspberry}}& {"Rubus"}& {11.9400000000000}& {220}& {\mathrm{Russia}}& {543421}& {\mathrm{false}}\\ {\mathrm{Grape}}& {"Vitis"}& {18.1000000000000}& {288}& {\mathrm{China}}& {58500118}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {"Fragaria"}& {7.68000000000000}& {136}& {\mathrm{USA}}& {4594539}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {"Rubus"}& {9.61}& {180}& {\mathrm{Mexico}}& {\mathrm{undefined}}& {\mathrm{false}}\end{array}\right]$ (33)

Note: It is not recommended that DataFrames with strictly numeric indexed column labels are reordered. This may lead to a conflict between the column label and the index position.

Reordering rows

In the Reordering columns section, the ColumnLabels command was used to return the current order of the columns. The RowLabels command returns the current order of the rows:

 > RowLabels( DF );
 $\left[{\mathrm{Raspberry}}{,}{\mathrm{Grape}}{,}{\mathrm{Strawberry}}{,}{\mathrm{Blackberry}}\right]$ (34)

To reorder rows in a DataFrame, use the DataFrame index notation to specify the new order for the rows and reassign this to the existing DataFrame variable.

 > DF := DF[ [ Blackberry, Grape, Raspberry, Strawberry ], .. ];
 ${\mathrm{DF}}{≔}\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Blackberry}}& {"Rubus"}& {9.61}& {180}& {\mathrm{Mexico}}& {\mathrm{undefined}}& {\mathrm{false}}\\ {\mathrm{Grape}}& {"Vitis"}& {18.1000000000000}& {288}& {\mathrm{China}}& {58500118}& {\mathrm{true}}\\ {\mathrm{Raspberry}}& {"Rubus"}& {11.9400000000000}& {220}& {\mathrm{Russia}}& {543421}& {\mathrm{false}}\\ {\mathrm{Strawberry}}& {"Fragaria"}& {7.68000000000000}& {136}& {\mathrm{USA}}& {4594539}& {\mathrm{false}}\end{array}\right]$ (35)

Note: It is not recommended that DataFrames with strictly numeric indexed row labels are reordered. This may lead to a conflict between the row label and the index position.

Sorting A DataFrame

One of the most common tasks when working with data is to order the data by ascending or descending numeric value, alphabetical order, or using some other ordering. The sort command orders a DataFrame by values in a selected column (DataSeries).

For example, to sort the DataFrame in order of ascending carbohydrate value:

 > sort( DF, Carbohydrates );
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Strawberry}}& {"Fragaria"}& {7.68000000000000}& {136}& {\mathrm{USA}}& {4594539}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {"Rubus"}& {9.61}& {180}& {\mathrm{Mexico}}& {\mathrm{undefined}}& {\mathrm{false}}\\ {\mathrm{Raspberry}}& {"Rubus"}& {11.9400000000000}& {220}& {\mathrm{Russia}}& {543421}& {\mathrm{false}}\\ {\mathrm{Grape}}& {"Vitis"}& {18.1000000000000}& {288}& {\mathrm{China}}& {58500118}& {\mathrm{true}}\end{array}\right]$ (36)

To sort the DataFrame by descending energy value:

 > sort( DF, Energy, > );
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Grape}}& {"Vitis"}& {18.1000000000000}& {288}& {\mathrm{China}}& {58500118}& {\mathrm{true}}\\ {\mathrm{Raspberry}}& {"Rubus"}& {11.9400000000000}& {220}& {\mathrm{Russia}}& {543421}& {\mathrm{false}}\\ {\mathrm{Blackberry}}& {"Rubus"}& {9.61}& {180}& {\mathrm{Mexico}}& {\mathrm{undefined}}& {\mathrm{false}}\\ {\mathrm{Strawberry}}& {"Fragaria"}& {7.68000000000000}& {136}& {\mathrm{USA}}& {4594539}& {\mathrm{false}}\end{array}\right]$ (37)

It is also possible to sort using a DataSeries containing strings:

 > sort( DF, Country );
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Grape}}& {"Vitis"}& {18.1000000000000}& {288}& {\mathrm{China}}& {58500118}& {\mathrm{true}}\\ {\mathrm{Blackberry}}& {"Rubus"}& {9.61}& {180}& {\mathrm{Mexico}}& {\mathrm{undefined}}& {\mathrm{false}}\\ {\mathrm{Raspberry}}& {"Rubus"}& {11.9400000000000}& {220}& {\mathrm{Russia}}& {543421}& {\mathrm{false}}\\ {\mathrm{Strawberry}}& {"Fragaria"}& {7.68000000000000}& {136}& {\mathrm{USA}}& {4594539}& {\mathrm{false}}\end{array}\right]$ (38)

Filtering data in a DataFrame

A very useful aspect of data frames is that they can be queried for subsets of the DataFrame that match a given query. Queries on data frames return a truth table whose entries are either true, false or FAIL depending on if a given element meets a given criteria.

For example, to return all of the berries that have greater than 10 g of carbohydrates per 100g:

 > DF[ Carbohydrates ] >~ 10;
 $\left[\begin{array}{cc}{\mathrm{Blackberry}}& {\mathrm{false}}\\ {\mathrm{Grape}}& {\mathrm{true}}\\ {\mathrm{Raspberry}}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {\mathrm{false}}\end{array}\right]$ (39)

The truth table is a DataSeries that can be used to index the main DataFrame. If the DataFrame is indexed by a DataSeries of type truefalse, it returns a filtered DataFrame:

 > DF[ DF[ Carbohydrates ] >~ 10 ];
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Grape}}& {"Vitis"}& {18.1000000000000}& {288}& {\mathrm{China}}& {58500118}& {\mathrm{true}}\\ {\mathrm{Raspberry}}& {"Rubus"}& {11.9400000000000}& {220}& {\mathrm{Russia}}& {543421}& {\mathrm{false}}\end{array}\right]$ (40)

It is possible to use the add command to tally the values of any column of a DataFrame, including the truth table generated using the above query:

 > add( DF[ Carbohydrates ] >~ 10 );
 ${2}{}{\mathrm{false}}{+}{2}{}{\mathrm{true}}$ (41)

This means that for the 4 individuals in the DataFrame, 2 out of 4 have greater then 10 g of carbohydrates per 100 g and 2 out of 4 have less than or equal to 10g of carbohydrates.

More advanced queries can be created by combining several queries into one statement using logical operators such as and, or, etc.

For example, which of the berries have less than 10 g of carbohydrates and greater than 150 kJ of energy?

 > DF[ DF[ Carbohydrates ] <~ 10 and DF[ Energy ] >~ 150 ];
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Blackberry}}& {"Rubus"}& {9.61}& {180}& {\mathrm{Mexico}}& {\mathrm{undefined}}& {\mathrm{false}}\end{array}\right]$ (42)

For more examples, see the Subsets of DataFrames example page.

What does with do?

The with command creates newly assigned variables that correspond to each of the column names.

 > with( DF );
 $\left[{\mathrm{Genus}}{,}{\mathrm{Carbohydrates}}{,}{\mathrm{Energy}}{,}{\mathrm{Country}}{,}{\mathrm{Total Tons}}{,}{\mathrm{Botanical Berry}}\right]$ (43)

These variables are each a DataSeries that has the same row labels as the parent DataFrame.

 > Energy;
 $\left[\begin{array}{cc}{\mathrm{Blackberry}}& {180}\\ {\mathrm{Grape}}& {288}\\ {\mathrm{Raspberry}}& {220}\\ {\mathrm{Strawberry}}& {136}\end{array}\right]$ (44)

Binding labels is useful for simplifying the query syntax:

 > Energy >~ 200;
 $\left[\begin{array}{cc}{\mathrm{Blackberry}}& {\mathrm{false}}\\ {\mathrm{Grape}}& {\mathrm{true}}\\ {\mathrm{Raspberry}}& {\mathrm{true}}\\ {\mathrm{Strawberry}}& {\mathrm{false}}\end{array}\right]$ (45)

The unwith command unassigns the column names:

 > unwith( DF );
 > Energy;
 ${\mathrm{Energy}}$ (46)

Reshaping data frames

The Transpose command returns the transpose of a DataFrame. This turns the variable columns into individual rows and the individual rows into variable columns.

 > Transpose( DF );
 $\left[\begin{array}{ccccc}{}& {\mathrm{Blackberry}}& {\mathrm{Grape}}& {\mathrm{Raspberry}}& {\mathrm{Strawberry}}\\ {\mathrm{Genus}}& {"Rubus"}& {"Vitis"}& {"Rubus"}& {"Fragaria"}\\ {\mathrm{Carbohydrates}}& {9.61}& {18.1000000000000}& {11.9400000000000}& {7.68000000000000}\\ {\mathrm{Energy}}& {180}& {288}& {220}& {136}\\ {\mathrm{Country}}& {\mathrm{Mexico}}& {\mathrm{China}}& {\mathrm{Russia}}& {\mathrm{USA}}\\ {\mathrm{Total Tons}}& {\mathrm{undefined}}& {58500118}& {543421}& {4594539}\\ {\mathrm{Botanical Berry}}& {\mathrm{false}}& {\mathrm{true}}& {\mathrm{false}}& {\mathrm{false}}\end{array}\right]$ (47)

Applying a function to columns of a DataFrame

Many top-level commands as well as commands in the Statistics package can be applied directly to data frames, however many commands do not work with data frames. For many functions, there simply is not a uniform application that can be sensibly applied to all the DataSeries in a DataFrame. In cases where one wants to apply a function to only a single DataSeries of a DataFrame, the ~ (element-wise) operator can be useful.

For example, say we wanted to round the values in the Carbohydrates column to the nearest integer value, the round command can be applied in the following way:

 > round~(DF[Carbohydrates]);
 $\left[\begin{array}{cc}{\mathrm{Blackberry}}& {10}\\ {\mathrm{Grape}}& {18}\\ {\mathrm{Raspberry}}& {12}\\ {\mathrm{Strawberry}}& {8}\end{array}\right]$ (48)

To change the values for this column in place, reassign the Carbohydrates column:

 > DF[Carbohydrates] := round~(DF[Carbohydrates]);
 ${{\mathrm{DF}}}_{{\mathrm{Carbohydrates}}}{≔}\left[\begin{array}{cc}{\mathrm{Blackberry}}& {10}\\ {\mathrm{Grape}}& {18}\\ {\mathrm{Raspberry}}& {12}\\ {\mathrm{Strawberry}}& {8}\end{array}\right]$ (49)

It may also be useful to apply a custom function to a DataSeries. For example, the following converts the values in the Energy column to Joules from KJoules.

 > (x -> x * 1000)~(DF[Energy]);
 $\left[\begin{array}{cc}{\mathrm{Blackberry}}& {180000}\\ {\mathrm{Grape}}& {288000}\\ {\mathrm{Raspberry}}& {220000}\\ {\mathrm{Strawberry}}& {136000}\end{array}\right]$ (50)

The following converts all the country names to strings:

 > (x -> convert( x, string) )~(DF[Country]);
 $\left[\begin{array}{cc}{\mathrm{Blackberry}}& {"Mexico"}\\ {\mathrm{Grape}}& {"China"}\\ {\mathrm{Raspberry}}& {"Russia"}\\ {\mathrm{Strawberry}}& {"USA"}\end{array}\right]$ (51)

Changing the datatype of a column

The datatypes option of the DataFrame constructor specifies the datatypes for each column of the DataFrame. The SubsDatatype command changes the value of the specified datatype and attempts to coerce the data contained in the column to that new datatype.

 > Datatypes( DF );
 $\left[{\mathrm{anything}}{,}{\mathrm{anything}}{,}{\mathrm{anything}}{,}{\mathrm{anything}}{,}{\mathrm{integer}}{,}{\mathrm{anything}}\right]$ (52)
 > SubsDatatype( DF, Energy, float );
 $\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Blackberry}}& {"Rubus"}& {10}& {180.}& {\mathrm{Mexico}}& {\mathrm{undefined}}& {\mathrm{false}}\\ {\mathrm{Grape}}& {"Vitis"}& {18}& {288.}& {\mathrm{China}}& {58500118}& {\mathrm{true}}\\ {\mathrm{Raspberry}}& {"Rubus"}& {12}& {220.}& {\mathrm{Russia}}& {543421}& {\mathrm{false}}\\ {\mathrm{Strawberry}}& {"Fragaria"}& {8}& {136.}& {\mathrm{USA}}& {4594539}& {\mathrm{false}}\end{array}\right]$ (53)

It is important to note that this is not done in-place; in order to have a permanent change, reassignment is required.

 > DF := SubsDatatype( DF, Energy, float );
 ${\mathrm{DF}}{≔}\left[\begin{array}{ccccccc}{}& {\mathrm{Genus}}& {\mathrm{Carbohydrates}}& {\mathrm{Energy}}& {\mathrm{Country}}& {\mathrm{Total Tons}}& {\mathrm{Botanical Berry}}\\ {\mathrm{Blackberry}}& {"Rubus"}& {10}& {180.}& {\mathrm{Mexico}}& {\mathrm{undefined}}& {\mathrm{false}}\\ {\mathrm{Grape}}& {"Vitis"}& {18}& {288.}& {\mathrm{China}}& {58500118}& {\mathrm{true}}\\ {\mathrm{Raspberry}}& {"Rubus"}& {12}& {220.}& {\mathrm{Russia}}& {543421}& {\mathrm{false}}\\ {\mathrm{Strawberry}}& {"Fragaria"}& {8}& {136.}& {\mathrm{USA}}& {4594539}& {\mathrm{false}}\end{array}\right]$ (54)
 > Datatypes( DF );
 $\left[{\mathrm{anything}}{,}{\mathrm{anything}}{,}{{\mathrm{float}}}_{{8}}{,}{\mathrm{anything}}{,}{\mathrm{integer}}{,}{\mathrm{anything}}\right]$ (55)

Converting a DataFrame to other data types

It may be beneficial from time to time to convert all columns or some columns of a DataFrame to other data types in order to interact with other commands in the Maple language. The convert command can convert a DataFrame to a Matrix, table, Array or a nested list (by supplying the option nested to a conversion to list).

For example, let us convert the numeric columns of the DataFrame to a Matrix:

 > convert( DF[ [ Carbohydrates, Energy, Total Tons ] ], Matrix );
 $\left[\begin{array}{ccc}{10}& {180.}& {\mathrm{undefined}}\\ {18}& {288.}& {58500118}\\ {12}& {220.}& {543421}\\ {8}& {136.}& {4594539}\end{array}\right]$ (56)

The following converts the columns with strings or names data into a nested list:

 > convert( DF[ [ Genus, Country ] ], list, nested );
 $\left[\left[{"Rubus"}{,}{"Vitis"}{,}{"Rubus"}{,}{"Fragaria"}\right]{,}\left[{\mathrm{Mexico}}{,}{\mathrm{China}}{,}{\mathrm{Russia}}{,}{\mathrm{USA}}\right]\right]$ (57)

Data frames can also be converted to Arrays:

 > convert( DF[ Botanical Berry ], Array );
 $\left[\begin{array}{cccc}{\mathrm{false}}& {\mathrm{true}}& {\mathrm{false}}& {\mathrm{false}}\end{array}\right]$ (58)

In most of these cases, the conversion is lossy; the column names and row names are discarded. The only case that attempts to keep the meta-data is the conversion to table, where each table index is the row name for the entry value:

 > DFtable := convert( DF[ Genus ], table );
 ${\mathrm{DFtable}}{≔}{table}{}\left(\left[{\mathrm{Blackberry}}{=}{"Rubus"}{,}{\mathrm{Strawberry}}{=}{"Fragaria"}{,}{\mathrm{Grape}}{=}{"Vitis"}{,}{\mathrm{Raspberry}}{=}{"Rubus"}\right]\right)$ (59)
 > indices( DFtable );
 $\left[{\mathrm{Blackberry}}\right]{,}\left[{\mathrm{Strawberry}}\right]{,}\left[{\mathrm{Grape}}\right]{,}\left[{\mathrm{Raspberry}}\right]$ (60)
 > entries( DFtable );
 $\left[{"Rubus"}\right]{,}\left[{"Fragaria"}\right]{,}\left[{"Vitis"}\right]{,}\left[{"Rubus"}\right]$ (61)

Creating an empty DataFrame

In order to create an empty DataFrame, supply an empty list of lists as the first argument:

 > NewDF := DataFrame( [[]], columns = [Column1] );
 ${\mathrm{NewDF}}{≔}\left[\begin{array}{cc}{}& {\mathrm{Column1}}\end{array}\right]$ (62)

Observations can then be added to the DataFrame using the Append command. For more details, see the Adding a new column or row section.

 > NewDF := Append( NewDF, DataFrame( <<1>>, columns = [Column1] ) );
 ${\mathrm{NewDF}}{≔}\left[\begin{array}{cc}{}& {\mathrm{Column1}}\\ {1}& {1}\end{array}\right]$ (63)
 > NewDF := Append( NewDF, [2], label = Column2 );
 ${\mathrm{NewDF}}{≔}\left[\begin{array}{ccc}{}& {\mathrm{Column1}}& {\mathrm{Column2}}\\ {1}& {1}& {2}\end{array}\right]$ (64)

Importing data into a DataFrame

By default, the Import command returns data frames when importing many tabular file formats such as .csv, .tsv, .xls, or .xlsx files.

  (65)