One of the things I often do in R is sampling rows from a data frame. This is especially useful when you have too much or unbalanced data. There are plenty of tutorials on the internet on how to sample rows from a data frame using dplyr
, but I personally like to keep my dependencies to a minimum in order to guarantee maximum replicability. Therefore, here is how to sample rows from a data frame using only base R functions.
First, let us assume that we have a dataframe called df
. I will show two ways of sampling rows from this data frame.
Sample a fixed number of rows
In the first situation, I will sample a fixed number of rows, 10 in this case:
N_SAMPLE <- 10
Now, let us look at the actual sampling. To do this, we will use the base R function sample
. This function is actually used to sample elements from a vector, so we will have to come up with a clever way to instead use it to sample data frame rows. Take a look at the following snippet:
sample_row_indices <- sample(nrow(df),
size = N_SAMPLE,
replace = FALSE)
The arguments of sample
are used as follows:
nrow(df)
: we tellsample
to sample from a vector with the same length as the number of rows in our data framesize = N_SAMPLE
: we tellsample
that we want to sample 10 numbers from the vector of numbersreplace = FALSE
: we tellsample
to make sure that sampled numbers cannot be sampled again
If we look at the output of the sample
function, we see that, indeed, it returned a vector of 10 row indices of our data frame:
> sample_row_indices
[1] 2287 13695 7677 7453 12082 10487 8592 8567 13740 2598
To get the rows corresponding to these indices, we simply index the original data frame using our newly sampled indices:
df_sample <- df[sample_row_indices,]
Sample a percentage of the total number of rows
In the second situation, I will sample a percentage of the total number of rows in the dataframe. In this case, I will sample 70% of the rows.
SAMPLE_SHARE = 0.7
Now, let us look at the actual sampling. Again, we will use the base R function sample
.
sample_row_included <- sample(c(TRUE, FALSE),
size=nrow(df),
replace=TRUE,
prob=c(SAMPLE_SHARE, 1 - SAMPLE_SHARE))
The arguments of sample
are used as follows:
c(TRUE, FALSE)
: we tellsample
to sample from a vector with only two elements:TRUE
orFALSE
(in our sample, not in our sample)size=nrow(df)
: we tellsample
that we want to sample all numbers from the vector of numbers- We do this because we want a
TRUE
/FALSE
result for all rows of the data frame
- We do this because we want a
replace = FALSE
: we tellsample
to make sure that sampled numbers cannot be sampled againprob=c(0.7, 0.3)
: we tellsample
that we wantTRUE
(in the sample) to have a probability ofSAMPLE_SHARE
(= 0.7), whileFALSE
(not in the sample) gets a probability of1 - SAMPLE_SHARE
(= 0.3)
If we look at the output of the sample
function, we see that it returned a vector specifying for each row in our data frame whether it should be included in the sample or not:
> sample_row_included
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE ...
To get the rows corresponding to our sample, we simply index the original data frame using our newly created vector of TRUE
s and FALSE
s:
df_sample <- df[sample_row_included, ]
To get the rows NOT corresponding to our sample, we invert the polarity of the elements in the vector of TRUE
s and FALSE
s, so TRUE
becomes FALSE
and FALSE
becomes TRUE
:
df_negative_sample <- df[!sample_row_included, ]