Chapter 10 Subsetting and filtering R objects

(Subsetting with the [], [[]], and $ operators, filtering)

10.1 Overview

10.1.1 Abstract:

Subsetting and filtering are among the most important operations with data. R provides powerful syntax for these operations. Learn about and practice them in this unit.

10.1.2 Objectives:

This unit will:

  • introduce subsetting principles;
  • practice them on data;

10.1.3 Outcomes:

After working through this unit you:

  • can subset and filter data according to six different principles.

10.1.4 Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

10.1.5 Prerequisites

RPR-Objects-Lists (R "Lists")

10.2 Subsetting

10.3 Task 20

Load the R-Exercise_BasicSetup project in RStudio if you don't already have it open. Type init() as instructed after the project has loaded. Continue below.

We have encountered "subsetting" before, but we really need to discuss this in more detail. It is one of the most important topics of R since it is indispensable to select, transform, and otherwise modify data to prepare it for analysis. You have seen that we use square brackets to indicate individual elements in vectors and matrices. These square brackets are actually "operators", and you can find more information about them in the help pages:

?"["     # Note that you need quotation marks around the operator for this.

Note especially:

  • [ ] "extracts" one or more elements defined within the brackets;
  • [[ ]] "extracts" a single element defined within the brackets;
  • $ "extracts" a single named element.
  • "Elements" are not necessarily scalars, but can apply to a row, column, or more complex data structure. But a "single element" can't be a range, or collection.

Here are some examples of subsetting data from the plasmidData data frame we constructed previously. For the most part, this is review:

plasmidData[1, ]
##    Name Size Marker   Ori                                         Sites
## 1 pUC19 2686    Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
plasmidData[2, ]
##     Name Size   Marker   Ori                Sites
## 2 pBR322 4361 Amp, Tet ColE1 EcoRI, ClaI, HindIII

we can extract more than one row by specifying the rows we want in a vector ...

plasmidData[c(1, 2), ]
##     Name Size   Marker   Ori                                         Sites
## 1  pUC19 2686      Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
## 2 pBR322 4361 Amp, Tet ColE1                          EcoRI, ClaI, HindIII

... this works in any order ...

plasmidData[c(3, 1), ]
##       Name Size   Marker   Ori                                         Sites
## 3 pACYC184 4245 Tet, Cam  p15A                                 ClaI, HindIII
## 1    pUC19 2686      Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII

... and for any number of rows ...

plasmidData[c(1, 2, 1, 2, 1, 2), ]
##       Name Size   Marker   Ori                                         Sites
## 1    pUC19 2686      Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
## 2   pBR322 4361 Amp, Tet ColE1                          EcoRI, ClaI, HindIII
## 1.1  pUC19 2686      Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
## 2.1 pBR322 4361 Amp, Tet ColE1                          EcoRI, ClaI, HindIII
## 1.2  pUC19 2686      Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
## 2.2 pBR322 4361 Amp, Tet ColE1                          EcoRI, ClaI, HindIII

Same for columns

plasmidData[ , 2 ]
## [1] 2686 4361 4245

We can select rows and columns by name if a name has been defined...

plasmidData[, "Name"]
## [1] "pUC19"    "pBR322"   "pACYC184"
plasmidData$Name      # different syntax, same thing. This is the syntax I use most frequently.
## [1] "pUC19"    "pBR322"   "pACYC184"

Watch this!

plasmidData$Name[plasmidData$Ori != "ColE1"]
## [1] "pACYC184"

What happened here? plasmidData$Ori != "ColE1" is a logical expression, it gives a vector of TRUE/FALSE values

plasmidData$Ori != "ColE1"
## [1] FALSE FALSE  TRUE

We insert this vector into the square brackets. R then returns all rows for which the vector is TRUE. In this way we can "filter" for values

plasmidData$Size > 3000
## [1] FALSE  TRUE  TRUE
plasmidData$Name[plasmidData$Size > 3000]
## [1] "pBR322"   "pACYC184"

This principle is what we use when we want to "sort" an object by some value. The function order() is used to return values that are sorted. Remember this: not sort() but order().

order(plasmidData$Size)
## [1] 1 3 2
plasmidData[order(plasmidData$Size), ]
##       Name Size   Marker   Ori                                         Sites
## 1    pUC19 2686      Amp ColE1 EcoRI, SacI, SmaI, BamHI, XbaI, PstI, HindIII
## 3 pACYC184 4245 Tet, Cam  p15A                                 ClaI, HindIII
## 2   pBR322 4361 Amp, Tet ColE1                          EcoRI, ClaI, HindIII

grep() matches substrings in strings and returns a vector of indices

grep("Tet", plasmidData$Marker)
## [1] 2 3
plasmidData[grep("Tet", plasmidData$Marker), ]
##       Name Size   Marker   Ori                Sites
## 2   pBR322 4361 Amp, Tet ColE1 EcoRI, ClaI, HindIII
## 3 pACYC184 4245 Tet, Cam  p15A        ClaI, HindIII
plasmidData[grep("Tet", plasmidData$Marker), "Ori"]
## [1] "ColE1" "p15A"

Elements that can be extracted from an object also can be replaced. Simply assign the new value to the element.

( x <- sample(1:10) )
##  [1]  1  3  9  7  5  8  2 10  4  6
x[4] <- 99
x
##  [1]  1  3  9 99  5  8  2 10  4  6
( x <- x[order(x)] )
##  [1]  1  2  3  4  5  6  8  9 10 99

Try your own subsetting ideas. Play with this. I find that even seasoned investigators have problems with subsetting their data and if you become comfortable with the many ways of subsetting, you will be ahead of the game right away.

10.4 Task 21

  • The R-Exercise_BasicSetup project contains a file subsettingPractice.R
  • Open the file and work through it.

10.5 Self-evaluation