anthe.sevenants

How to remove all rows with column value frequency below n in R

2024-10-28

Imagine the following situation: you have a data frame in R, with observations as rows. One column identifies where these observations come from. You could, for example, have a list of mispronunciations, with a distinct column identifying the speaker of those mispronunciations:

mispronunciation_id speaker
1 Monica
2 Monica
3 Erica
4 Erica
5 Erica
6 Erica
7 Rita
8 Rita
9 Rita

If we want to make a model of mispronunciations per speaker, we ideally want the model to have enough data. Therefore, you could decide to eliminate the observations of speakers for which there are too few rows. Or, more generally, we want to remove all rows for which a column value appears less than 𝑛 times.

To do this in R, we first create a frequency table of all the values found in the column:

# Build a frequency table so we know the speaker counts
speaker_counts <- table(df$speaker)

Then, we simply remove all items below a certain frequency. We do this by checking for each row whether the value for 'speaker' has a frequency in the frequency table equal to or bigger than the minimum threshold:

# Add minimum frequency
MINIMUM_FREQUENCY <- 3
df <- subset(df,
        speaker %in% 
        names(speaker_counts[speaker_counts >= MINIMUM_FREQUENCY]))

This leaves us with the following data frame (sorry Monica!):

mispronunciation_id speaker
3 Erica
4 Erica
5 Erica
6 Erica
7 Rita
8 Rita
9 Rita

Full snippet:

df <- data.frame(
  mispronunciation_id = 1:9,
  speaker = c(
    "Monica",
    "Monica",
    "Erica",
    "Erica",
    "Erica",
    "Erica",
    "Rita",
    "Rita",
    "Rita"
  )
)

# Build a frequency table so we know the speaker counts
speaker_counts <- table(df$speaker)

# Add minimum frequency
MINIMUM_FREQUENCY <- 3
df <- subset(df,
        speaker %in% 
        names(speaker_counts[speaker_counts >= MINIMUM_FREQUENCY]))

Image credit: pngimg.com