Filter each column of a data.frame based on a specific value

pandas dataframe filter by column value like
pandas dataframe filter multiple columns
pandas find value in any column
r filter dataframe by column value
pandas select rows by multiple conditions
pandas series filter by value
pandas select columns by condition
r subset dataframe by list of values

Consider the following data frame:

df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE)))

#   X1 X2 X3 X4 X5
#1   7  9  8  4 10
#2   2  4  9  4  9
#3   2  7  8  8  6
#4   8  9  6  6  4
#5   5  2  1  4  6
#6   8  2  2  1  7
#7   3  8  6  1  6
#8   3  8  5  9  8
#9   6  2  3 10  7
#10  2  7  4  2  9

Using dplyr, how can I filter, on each column (without implicitly naming them), for all values greater than 2.

Something that would mimic an hypothetical filter_each(funs(. >= 2))

Right now I'm doing:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2)

Which is equivalent to:

df %>% filter(!rowSums(. < 2))

Note: Let's say I wanted to filter only on the first 4 columns, I would do:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2) 

or

df %>% filter(!rowSums(.[-5] < 2))

Would there be a more efficient alternative ?

Edit: sub question

How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

Benchmark sub question

Since I have to run this on a large dataset, I benchmarked the suggestions.

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50
)

Here are the results:

#Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval
#   Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458    50
# Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669    50
# Docendo  874.0247  933.1399  983.5435  985.3697 1026.901 1053.407    50


Here's another option with slice which can be used similarly to filter in this case. Main difference is that you supply an integer vector to slice whereas filter takes a logical vector.

df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))

What I like about this approach is that because we use select inside rowSums you can make use of all the special functions that select supplies, like matches for example.


Let's see how it compares to the other answers:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
    Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
    Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
    dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
    times = 50L,
    unit = "relative"
)

#Unit: relative
#     expr      min       lq   median       uq      max neval
#    Marat 1.304216 1.290695 1.290127 1.288473 1.290609    50
#  Richard 1.139796 1.146942 1.124295 1.159715 1.160689    50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50

Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).


Following a comment that base R would have the same speed as the slice approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:

base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ]

Benchmark:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
  Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
  Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
  dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
  base = df[!rowSums(df[-5L] < 2L), ],
  base_which = df[which(!rowSums(df[-5L] < 2L)), ],
  times = 50L,
  unit = "relative"
)

#Unit: relative
#       expr      min       lq   median       uq      max neval
#      Marat 1.265692 1.279057 1.298513 1.279167 1.203794    50
#    Richard 1.124045 1.160075 1.163240 1.169573 1.076267    50
#   dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50
#       base 2.784058 2.769062 2.710305 2.669699 2.576825    50
# base_which 1.458339 1.477679 1.451617 1.419686 1.412090    50

Not really any better or comparable performance with these two base R approaches.

Edit note #2: added benchmark with base R options.

How do you filter a Dataframe based on column values? Often, you may want to subset a pandas dataframe based on one or more values of a specific column. Essentially, we would like to select rows based on one value or multiple values present in a column. Here are SIX examples of using Pandas dataframe to filter rows or select rows based values of a column(s).


Here's an idea that makes it fairly simple to choose the names. You can set up a list of calls to send to the .dots argument of filter_(). First a function that creates an unevaluated call.

Call <- function(x, value, fun = ">=") call(fun, as.name(x), value)

Now we use filter_(), passing a list of calls into the .dots argument using lapply(), choosing any name and value you want.

nm <- names(df) != "X5"
filter_(df, .dots = lapply(names(df)[nm], Call, 2L))
#   X1 X2 X3 X4 X5
# 1  6  5  7  3  1
# 2  8 10  3  6  5
# 3  5  7 10  2  5
# 4  3  4  2  9  9
# 5  8  3  5  6  2
# 6  9  3  4 10  9
# 7  2  9  7  9  8

You can have a look at the unevaluated calls created by Call(), for example X4 and X5, with

lapply(names(df)[4:5], Call, 2L)
# [[1]]
# X4 >= 2L
#
# [[2]]
# X5 >= 2L

So if you adjust the names() in the X argument of lapply(), you should be fine.

How do I subset a Dataframe based on column value in R? Select Pandas Rows based on specific column value. We can select pandas rows from a DataFrame that contains or does not contain the specific value for a column. It is widely used in filtering the DataFrame based on column value. Select Pandas rows which contain specific column value Filter using boolean indexing


How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

It might be not the most elegant solution, but it gets the job done:

df %>% filter(!rowSums(.[,!colnames(.)%in%'X5',drop=F] < 2))

In case of several excluded columns (e.g. X3,X5), one can use:

df %>% filter(!rowSums(.[,!colnames(.)%in%c('X3','X5'),drop=F] < 2))

Here's another option with slice which can be used similarly to filter in this case. Main difference is that you supply an integer vector to slice  Filter each column of a data.frame based on a specific value How to specify a column name and mimic an hypothethical filter_each Store every value in a


If you only wanted to filter on the first four columns, as:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2) 

...try this:

df %>% 
  filter_at(vars(X1:X4), #<Select columns to filter
  all_vars(.>=2) )       #<Scope with all_vars (or any_vars)

An alternative is to exclude the columns you'd like to filter, as:

df %>% 
  filter_at(vars(-X5)), #<Exclude column X5
  all_vars(.>=2) )

Quite often it is a requirement to filter tabular data based on a column This tutorial will focus on two easy ways to filter a Dataframe by column value. When I print the first 5 entries of my Boolean lists, all the results are True. In this article, we will cover various methods to filter pandas dataframe in Python. Data Filtering is one of the most frequent data manipulation operation. It is similar to WHERE clause in SQL or you must have used filter in MS Excel for selecting specific rows based on some conditions.


slice(): Extract rows by position; filter(): Extract rows that meet a certain logical These functions replicate the logical criteria over all variables or a selection of We will also show you how to remove rows with missing values in a given column​. We'll use the R built-in iris data set, which we start by converting into a tibble  But is it possible to filter based on either column a or b? So far the code I have is below and it seems to be filtering if the searched word occurs in both columns, not one or the other. `Private sub SearchButton1_Click() Dim sh as worksheet, srchWd as stringsrchWd = seatchbox.value. For each sh in thisworkbook.sheetsWith sh.cells


When working with data frames in R, it is often useful to manipulate and summarize data. Dplyr aims to provide a function for each basic verb of data manipulating, like: filter() (and slice() ). filter rows based on values in specified columns. Iterate Over columns in dataframe by index using iloc[] To iterate over the columns of a Dataframe by index we can iterate over a range i.e. 0 to Max number of columns then for each index we can select the columns contents using iloc[]. Let’s see how to iterate over all columns of dataframe from 0th index to last index i.e.


Unlike base subsetting with [ , rows where the condition evaluates to NA are dropped. All main verbs are S3 generics and provide methods for tbl_df() , dtplyr::tbl_dt() and When applied to a data frame, row names are silently dropped. To filter based on such a list for a given variable you can use the %in​% operator: I'm not looking to restrict the list in Filter 2 to relevant values. I'm looking to automatically choose the appropriate products in Filter 2 based on the value selected in Filter 1. Example: 1) The user selects "Promotion A" in Filter 1. 2) Based on this action, the values "Product 1, Product 3, Product 9" are automatically selected in Filter 2