Elegant way to drop rare factor levels from data frame

drop all levels in r
droplevels dplyr
add factor to data frame r
r subset dataframe by factor level
r empty factor levels
r cran drop levels
drop unused levels glm
devise a strategy to assure that the above factor has dropped the levels that have no elements

I want to subset a dataframe by factor. I only want to retain factor levels above a certain frequency.

df <- data.frame(factor = c(rep("a",5),rep("b",5),rep("c",2)), variable = rnorm(12))

This code creates data frame:

   factor    variable
1       a -1.55902013
2       a  0.22355431
3       a -1.52195456
4       a -0.32842689
5       a  0.85650212
6       b  0.00962240
7       b -0.06621508
8       b -1.41347823
9       b  0.08969098
10      b  1.31565582
11      c -1.26141417
12      c -0.33364069

And I want to drop factor levels which repeated less than 5 times. I developed a for-loop and it is working:

for (i in 1:length(levels(df$factor))){
  if(table(df$factor)[i] < 5){
    df.new <- df[df$factor != names(table(df$factor))[i],] 
  }
}

But do quicker and prettier solutions exists?

What about

df.new <- df[!(as.numeric(df$factor) %in% which(table(df$factor)<5)),]

Drop Unused Levels from Factors, So what it does is “drop” the output from a data frame containing a single variable​, two different ways you can sort a factor: alphabetically (by label) or by factor level. which is a fancy way of saying that computers will always round a number to This is pretty rare in real world data analysis, but just in case it does occur,  Drop factor levels in a subsetted data frame. I have a data frame containing a factor. When I create a subset of this data frame using subset() or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels -- even when they do not exist in the new data frame.

require(dplyr)

df %>% group_by(factor) %>% filter(n() >= 5)
#factor   variable
#1       a  2.0769363
#2       a  0.6187513
#3       a  0.2426108
#4       a -0.4279296
#5       a  0.2270024
#6       b -0.6839748
#7       b -0.3285610
#8       b  0.2625743
#9       b -0.9532957
#10      b  1.4526317

7.5 Extracting a subset of a data frame, The first thing you need to do is look in the lower right hand panel in RStudio. Of course, in the true R tradition, the objects() function has a lot of fancy 1 ## SSw numeric 1 ## stock.levels character 12 ## suspicious.cases logical 176 ## t.​3 to know how much money they could charge on an hourly rate if they wanted to  I have observed that the factor variable retains all of its original levels, even if they do not exist in the new data frame. This creates problems while plotting or using the functions that rely on factor levels. Is there any way to remove levels from a factor in the new data frame i.e. the data frame I have taken a subset of. Below is my example:

library(data.table)
setDT(df)[, variable[.N >= 5], by = factor]

##    factor         V1
## 1:      a -0.8204684
## 2:      a  0.4874291
## 3:      a  0.7383247
## 4:      a  0.5757814
## 5:      a -0.3053884
## 6:      b  1.5117812
## 7:      b  0.3898432
## 8:      b -0.6212406
## 9:      b -2.2146999
## 10:     b  1.1249309

Chapter 4 Additional R concepts, For the data frame method, you should rarely specify exclude “globally” for all factor columns; rather the default uses the same factor-specific exclude as the factor  Graphic 1: Exemplifying Data Frame with Factor Columns. As you can see, our data consists of three columns: The first column is a factor with 3 levels; the second column is a factor with 2 levels; and the third column is numeric. Now, let’s apply the droplevels R function to this example data frame…

Maybe join with a filtered count of the factors:

library(dplyr)
common.factors <- df %.% group_by(factor) %.% tally() %.% filter(n >= 5) 
df.1 <- semi_join(df, common.factors)

droplevels: Drop Unused Levels from Factors, statistical methodology, and R provides an Open Source route to participation in The built-in R editor is not the most fancy editor you can think of. to represent a discrete variable in a data frame and want to analyze it. Use the function levels to see the different levels a factor variable has. shape, rate, scale –, 1, 1/​rate. BiXiC. Current position: IBM EE/A, Global Business Services, Lead Data Scientist, Moscow. 11 Elegant way to drop rare factor levels from data frame Jun 17 '14.

This worked for me:

df = df[df$factor %in% names(table(df$factor)) [table(df$factor) >=5],]

[PDF] An introduction to R, Re-encoding categorical values with too many levels: How do you use such variable; it is a factor (it may also be a string, if the data were read into R The function complete.cases() on a data frame returns TRUE for every row where there In this case, dropping all the rows with missing values eliminates almost 40  (Notice that subsetting does not in general drop unused levels). By default, levels are dropped from all factors in a data frame, but the except argument allows you to specify columns for which this is not wanted. See Also. subset for subsetting data frames. factor for definition of factors. drop for dropping array dimensions.

[PDF] Preparing data for analysis using R, A family of functions finishes off the chapter by showing you how functionals can How would you apply it to every numeric column in a data frame? Reduce() is an elegant way of extending a function that works with two inputs 1 2 3 str(​Find(is.factor, df)) #> Factor w/ 3 levels "a","b","c": 1 2 3 Position(is.factor, df) #> [1] 2  Accessing components of a factor is very much similar to that of vectors. > x [1] single married married single Levels: married single > x[3] # access 3rd element [1] married Levels: married single > x[c(2, 4)] # access 2nd and 4th element [1] married single Levels: married single > x[-1]

Functionals · Advanced R., R's data.table package extends data.frame:. A relatively rare case of segfault when combining non-equi joins with by=. Previously, column names were dropped and there was no way to keep them. keep.names="rn" keeps the column names and Unused factor levels were already retained for items having nrow>​=1 . Arguments x. an object from which to drop unused factor levels. exclude. passed to factor(); factor levels which should be excluded from the result even if present.Note that this was implicitly NA in R <= 3.3.1 which did drop NA levels even when present in x, contrary to the documentation.

data.table/NEWS.md at master · Rdatatable/data.table · GitHub, Elegant graphics for data analysis 2nd Ed. Springer. •. I actually used the first edition keeps/drops records based on position in data frame select() Later we will see a rare case of how grouped data can be problematic for ggplot2. • If sample_n(n) is return(factor(bin, levels=1:bin.count, ordered=order)). }. Re: Refactor all factors in a data frame In reply to this post by hadley wickham Hi, thanks for all suggestions - I found a solution myself within 5 minutes, but your suggestions are surely more elegant / shorter.

Comments
  • +1 and Wow, data.table continues to impress. My only criticism would be it's difficult to read.
  • @Hugh, how is it more difficult than dplyr :)?
  • @DavidArenburg check out the dplyr solution from @beginneR. i also find the dplyr grammar much easier to read than data.table.
  • You probably want a semi join