anthe.sevenants

How to sample rows from a data frame in base R

2023-02-01

One of the things I often do in R is sampling rows from a data frame. This is especially useful when you have too much or unbalanced data. There are plenty of tutorials on the internet on how to sample rows from a data frame using dplyr, but I personally like to keep my dependencies to a minimum in order to guarantee maximum replicability. Therefore, here is how to sample rows from a data frame using only base R functions.

First, let us assume that we have a dataframe called df. I will show two ways of sampling rows from this data frame.

Sample a fixed number of rows

In the first situation, I will sample a fixed number of rows, 10 in this case:

N_SAMPLE <- 10

Now, let us look at the actual sampling. To do this, we will use the base R function sample. This function is actually used to sample elements from a vector, so we will have to come up with a clever way to instead use it to sample data frame rows. Take a look at the following snippet:

sample_row_indices <- sample(nrow(df),
                             size = N_SAMPLE,
                             replace = FALSE)

The arguments of sample are used as follows:

  1. nrow(df): we tell sample to sample from a vector with the same length as the number of rows in our data frame
  2. size = N_SAMPLE: we tell sample that we want to sample 10 numbers from the vector of numbers
  3. replace = FALSE: we tell sample to make sure that sampled numbers cannot be sampled again

If we look at the output of the sample function, we see that, indeed, it returned a vector of 10 row indices of our data frame:

> sample_row_indices

[1]  2287 13695  7677  7453 12082 10487  8592  8567 13740  2598

To get the rows corresponding to these indices, we simply index the original data frame using our newly sampled indices:

df_sample <- df[sample_row_indices,]

Sample a percentage of the total number of rows

In the second situation, I will sample a percentage of the total number of rows in the dataframe. In this case, I will sample 70% of the rows.

SAMPLE_SHARE = 0.7

Now, let us look at the actual sampling. Again, we will use the base R function sample.

sample_row_included <- sample(c(TRUE, FALSE),
                              size=nrow(df),
                              replace=TRUE,
                              prob=c(SAMPLE_SHARE, 1 - SAMPLE_SHARE))

The arguments of sample are used as follows:

  1. c(TRUE, FALSE): we tell sample to sample from a vector with only two elements: TRUE or FALSE (in our sample, not in our sample)
  2. size=nrow(df): we tell sample that we want to sample all numbers from the vector of numbers
    • We do this because we want a TRUE/FALSE result for all rows of the data frame
  3. replace = FALSE: we tell sample to make sure that sampled numbers cannot be sampled again
  4. prob=c(0.7, 0.3): we tell sample that we want TRUE (in the sample) to have a probability of SAMPLE_SHARE (= 0.7), while FALSE (not in the sample) gets a probability of 1 - SAMPLE_SHARE (= 0.3)

If we look at the output of the sample function, we see that it returned a vector specifying for each row in our data frame whether it should be included in the sample or not:

> sample_row_included

[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE ...

To get the rows corresponding to our sample, we simply index the original data frame using our newly created vector of TRUEs and FALSEs:

df_sample <- df[sample_row_included, ]

To get the rows NOT corresponding to our sample, we invert the polarity of the elements in the vector of TRUEs and FALSEs, so TRUE becomes FALSE and FALSE becomes TRUE:

df_negative_sample <- df[!sample_row_included, ]