Home > Teaching > Tutorials > R Tutorial Index
4 ::: Subsetting in R
An example
Boolean variables
Subsetting
Expanding the example
Stacking Conditions
returning certain indexes
Consider the Bumpus dataset:
> bumpus <- read.delim("http://www.stat.ucla.edu/~david/bumpus.txt", T) > attach(bumpus)
In this data set there are three columns of particular interest to this example: Length_mm, Survival, and Survived. To look at the first 5 bird lengths, use Length_mm[1:5]. Likewise, to look at the first five elements of Survival, add the [1:5] to the end:
> Length_mm[1:5] [1] 154 165 160 160 155 > Survival[1:5] [1] TRUE FALSE FALSE TRUE TRUE
Here, Survival is a vector where TRUE means the bird survived. Because it is a TRUE/FALSE vector, it can be used in conditioning easily:
> Length_mm[Survival] # same as Length_mm[Survival == TRUE] [1] 154 160 155 154 156 161 157 159 158 158 160 162 161 160 159 158 159 166 159 [20] 160 161 163 156 165 160 158 160 157 159 160 158 161 160 160 153 156 156 163 [39] 163 160 145 162 163 164 163 160 160 158 158 158 155 156 154 153 153 155 163 [58] 157 155 164 158 158 160 161 157 157 156 158 153 155 163 159
The only lengths that were returned were those values that correspond to a value TRUE in Survival. (See that the vector begins with 154 160 155, which are the 3 values in Survival[1:5] that correspond to TRUE in Survival[1:5].) Likewise, a vector of the lengths of birds that did not survive may be obtained by using
> Length_mm[Survival == FALSE] # same as using Length_mm[!Survival] [1] 165 160 161 162 163 162 163 161 160 162 160 161 162 165 161 161 162 164 158 [20] 162 156 166 165 166 160 156 158 166 165 157 164 166 167 161 166 161 155 156 [39] 160 152 160 155 157 165 153 162 162 159 159 155 162 152 159 155 163 163 156 [58] 159 161 155 162 153 162 164
The exclamation point in Length_mm[!Survival] takes all the TRUE values and makes them FALSE while it takes all FALSE values and makes them TRUE.
Above, Survival is a vector of Boolean values, which just means the values are either TRUE or FALSE. For subsetting in R, it is often most convenient to have a Boolean vector where each TRUE value corresponds to a value that should be returned (and FALSE otherwise). This concept is central to obtaining subsets of data quickly and easily in R and was used in the values printed out from the last two Length_mm outputs in the example above. Boolean vectors may also be created from other vectors by using ==. The vector Survived is similar to Survival except that it does not have Boolean values but levels instead:
> Survived[1:5] # the first 5 elements of Survived [1] Survived Died Died Survived Survived Levels: Died Survived
To make Survived into a Boolean vector, use ==:
> Survived == "Survived" [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE ...
(Only the first line of the output has been shown.) The vector returned is actually exactly the same as Survival.
As discussed above, a Boolean vector is convenient to subsetting. Look back at the example. There were different ways to create a Boolean vector from the vector Survival. Below lists the first line of values from each condition statement:
> Survival [1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE ... > Survival == FALSE [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE ... > !Survival [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE ...
The entire output of each vector is of the original length, 136 values. Each value corresponds to a value in Length_mm, which is, not by coincidence, 136 entries long. If the vector used for subsetting is a Boolean vector and is a different length than the length of the vector with the values of interest, problems will likely arise.
Think back to Length_mm[Survival] and its first 5 elements as an example. Because the first element of Survival was TRUE, the first element of Length_mm, 154, is included in the vector returned. In the opposite case, since element 2 in Survival is FALSE, element 2 of Length_mm is not returned.
This time round, instead of using Survival, Survived will be used. So, using the method in 'boolean variables':
> Length_mm[Survived == "Survived"] # returns lengths of surviving birds [1] 154 160 155 154 156 161 157 159 158 158 160 162 161 160 159 158 159 166 159 [20] 160 161 163 156 165 160 158 160 157 159 160 158 161 160 160 153 156 156 163 [39] 163 160 145 162 163 164 163 160 160 158 158 158 155 156 154 153 153 155 163 [58] 157 155 164 158 158 160 161 157 157 156 158 153 155 163 159 > Length_mm[Survived == "Died"] # returns lengths of surviving birds [1] 165 160 161 162 163 162 163 161 160 162 160 161 162 165 161 161 162 164 158 [20] 162 156 166 165 166 160 156 158 166 165 157 164 166 167 161 166 161 155 156 [39] 160 152 160 155 157 165 153 162 162 159 159 155 162 152 159 155 163 163 156 [58] 159 161 155 162 153 162 164
The two commands above are the equivalent of Length_mm[Survival] and Length_mm[Survival == FALSE], respectively.
Each of the vectors discussed above is an output. That means a subset of it can also be taken (so conditions can be stacked, if they are done carefully). The point and results of this realization is that if you want to see part of the possible output, do a second condition:
> Length_mm[Survival] [1] 154 160 155 154 156 161 157 159 158 158 160 162 161 160 159 158 159 166 159 [20] 160 161 163 156 165 160 158 160 157 159 160 158 161 160 160 153 156 156 163 [39] 163 160 145 162 163 164 163 160 160 158 158 158 155 156 154 153 153 155 163 [58] 157 155 164 158 158 160 161 157 157 156 158 153 155 163 159 > Length_mm[Survival][1:10] [1] 154 160 155 154 156 161 157 159 158 158
In the second command, only the first 10 values of the output are returned. This stacking may be useful for viewing small amounts of output.
Outside of a Boolean vectors for subsetting, there is also the numerical vector (ie, vector of integers) as a condition. If a vector of positive integers is put in the brackets, then only the corresponding entries will be returned:
> even <- c(2,4,6,8,10,12,14,16,18,20) # same as seq(2,20,2) > even[c(1,3,6)] [1] 2 6 12
In this example, even is a vector of even integers. Because c(1,3,6) is a vector of the numbers 1 3 6, even[c(1, 3, 6)] returns the 1st, 3rd, and 6th values of even. It is also okay to use a negative sign to remove specific elements:
> even[-c(1,3,6)] [1] 4 8 10 14 16 18 20
As can be seen, the elements that were returned in even[c(1,3,6)] are no longer included. Although this use of negative integers is slightly more complicated, it can be very useful.
Note: Negative and positive integers may not be combined in the brackets.