anthe.sevenants

How to filter in dplyr using your own function as a conditional

2022-07-13

When you want to filter data from, for example, a data frame in R using dplyr, you usually do something like this:

df <- filter(df, filter_column > 50)

This command will filter all rows in the data frame with a value for filter_column which is higher than 50.

If your filtering logic is a little more intricate, you might want to define it as a separate function. For example, in the next function, we return TRUE if the string passed contains the word "dog" or "cat":

is_woof_meow <- function(column_value) {
  allowed_values <- list("dog", "cat")

  return_flag = FALSE

  for (allowed_value in allowed_values) {
    if (grepl(allowed_value, column_value)) {
      return_flag = TRUE
    }
  }

  return(return_flag)
}

You would think that filtering a data frame using this function would be as easy as calling that function inside filter, like this:

df <- filter(df, is_woof_meow(filter_column))

If we run this code, however, we will get some warnings, in this case for the grepl function:

Warning messages:
1: In grepl(allowed_value, column_value) :
  argument 'pattern' has length > 1 and only the first element will be used

Indeed, R does not just pass each column value to our is_woof_meow function one by one. Instead, it passes the entire column at once, which is not what our code is designed for.

Now, we could rewrite our code to account for this, but it is much easier to simply vectorise our is_woof_meow function. To do this, we simply call Vectorize on it:

is_woof_meow_vec <- Vectorize(is_woof_meow)

Now, we can pass this vectorised function in the filter call, and our filtering will work as expected:

df <- filter(df, is_woof_meow_vec(filter_column))