When you want to filter data from, for example, a data frame in R using dplyr, you usually do something like this:
df <- filter(df, filter_column > 50)
This command will filter all rows in the data frame with a value for filter_column
which is higher than 50.
If your filtering logic is a little more intricate, you might want to define it as a separate function. For example, in the next function, we return TRUE if the string passed contains the word "dog" or "cat":
is_woof_meow <- function(column_value) {
allowed_values <- list("dog", "cat")
return_flag = FALSE
for (allowed_value in allowed_values) {
if (grepl(allowed_value, column_value)) {
return_flag = TRUE
}
}
return(return_flag)
}
You would think that filtering a data frame using this function would be as easy as calling that function inside filter
, like this:
df <- filter(df, is_woof_meow(filter_column))
If we run this code, however, we will get some warnings, in this case for the grepl
function:
Warning messages:
1: In grepl(allowed_value, column_value) :
argument 'pattern' has length > 1 and only the first element will be used
Indeed, R does not just pass each column value to our is_woof_meow
function one by one. Instead, it passes the entire column at once, which is not what our code is designed for.
Now, we could rewrite our code to account for this, but it is much easier to simply vectorise our is_woof_meow
function. To do this, we simply call Vectorize
on it:
is_woof_meow_vec <- Vectorize(is_woof_meow)
Now, we can pass this vectorised function in the filter
call, and our filtering will work as expected:
df <- filter(df, is_woof_meow_vec(filter_column))