R dplyr filtering data with values greater than +N and lesser than -N : abs() function?

r dplyr filter multiple values
r filter data frame multiple conditions
dplyr filter contains
dplyr filter not in
dplyr filter(between)
r select rows containing string
dplyr conditional filter
filter columns in r

I am using the dplyr package in R for filtering my data of gene expressions. I have calculated fold changes and would like to filter the genes (rows) in which at least one sample (columns) has a value greater than +0.584963 OR less than -0.584963.An example data:

       X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1      0.6780      0.4050      0.8870      0.3300      0.2230
GENE_2      0.2340     -0.6670      0.0020      0.1240      0.3560
GENE_3      0.0170      0.1560      0.1120      0.0080     -0.1230
GENE_4     -0.0944     -0.1372     -0.1800     -0.2228     -0.2656
GENE_5     -0.8080     -0.7800     -0.5560      0.0340      0.4450
GENE_6      0.2091      0.1106      0.0121     -0.0864     -0.1849
GENE_7      0.5980      0.7680      0.9970      0.4670     -0.7760

I am currently using the following script

det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("`",det.cols,"`",">abs(0.584963)", sep = "")))
my.datasub<- my.data %>% filter_(filt)

but this returns only the genes greater than +0.584963 and not the negative ones. In the case of the example, what I want is a subsetted list with Genes 1, 2, 5 and 7. But instead it gives me only Genes 1 and 7. How can I change this?

I am expecting the answer to be in this format:

 X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1      0.6780      0.4050      0.8870      0.3300      0.2230
GENE_2      0.2340     -0.6670      0.0020      0.1240      0.3560
GENE_5     -0.8080     -0.7800     -0.5560      0.0340      0.4450
GENE_7      0.5980      0.7680      0.9970      0.4670     -0.7760

Thanks.

Long story short, you had the abs() in the wrong place in your code.

I fixed it here:

det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("abs(`",det.cols,"`)",">0.584963", sep = "")))
my.datasub<- my.data %>% filter_(filt)

For further flexibility, @ha_pu provided a great filter_at solution building off my previous solution (before I identified the error in your code).

filter rows when all columns greater than a value, We can create a logical matrix my comparing the entire data frame with 2 and then do rowSums over it and select only those rows whose value is equal to number of columns in df A dplyr approach using filter_all and all_vars @ coldspeed In R, apply is actually very slow most of the times, especially� That's not the only way we can use dplyr to filter our data frame, however. We can use a number of different relational operators to filter in R. Relational operators are used to compare values. In R generally (and in dplyr specifically), those are: == (Equal to)!= (Not equal to) < (Less than) <= (Less than or equal to) > (Greater than)

Here's a solution that is flexible to the number of samples and data rows. It involves transforming the data into long format and then filters for the gene and specific sample. I tested it on 50k genes and 35 samples, and it ran in < 1 second.

library(tidyverse)

# set up sample data with 50000 rows
mydata <- data.frame(stringsAsFactors=FALSE,
                     X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
                     SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
                     SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
                     SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
                     SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
                     SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))

# duplicate 30 more columns
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])

(mydata3 <- mydata2 %>% gather(key = "sample_num", value = "fc", 2:length(mydata)) %>%
  filter(fc > 0.584963 | fc < -0.584963) %>%
  select(X) %>%
  arrange(desc(X)) %>%
  unique() %>%
  head())
#>         X
#> 1  GENE_7
#> 5  GENE_5
#> 7  GENE_2
#> 8  GENE_1
#> 10   9999
#> 14   9998

Created on 2019-03-01 by the reprex package (v0.2.1)

Some tricks on dplyr::filter – Sebastian Sauer Stats Blog, The R package dplyr has some attractive features; some say, this packkage revolutionized their workflow. At any rate, I like it Note that always a data frame tibble is returned. equality != inequality; < or > greater than/ smaller than; <= less or equal To get values ending with, say, “L”, we use $ in Regex:. Filter by single value in R; Filter by multiple values in R; Part 1. Installing dplyr package. As R doesn’t have this command built in, we will need to install an additional package in order to filter a dataset by value in R. You can learn more about dplyr package here.

Using filter_at from dplyr might be an even more flexible approach...

# set up sample data with 50000 rows [as proposed by Arthur Yip above]
mydata <- tibble(X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
                     SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
                     SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
                     SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
                     SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
                     SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))

# duplicate 30 more columns [as proposed by Arthur Yip above]
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])

mydata2 %>%
  filter_at(vars(contains("fc")), .vars_predicate =  any_vars(abs(.) > 0.584963))

In the vars() you can define your list of variables to which you want to apply the filtering. Following .vars_predicate you can define the filter criterion (any_vars is equal to |, all_vars is equal to &).

Subset rows using column values — filter • dplyr, Filtering data is one of the very basic operation when you work with data. Of course, dplyr has 'filter()' function to do such filtering, but there is even more. you get used to it especially if you're coming from outside of R world, but you When you look closer you'd notice that there are some NA values in� That’s not the only way we can use dplyr to filter our data frame, however. We can use a number of different relational operators to filter in R. Relational operators are used to compare values. In R generally (and in dplyr specifically), those are: == (Equal to)!= (Not equal to) (Less than) = (Less than or equal to) > (Greater than)

Filtering Data with dplyr. Filtering data is one of the very basic…, View source: R/filter.R data-masking > Expressions that return a logical value, and are defined in terms of the variables in .data . The following filters rows where `mass` is greater than the # global average: starwars %>% filter(mass >� In the ungrouped version, filter () compares the value of mass in each row to the global average (taken over the whole data set), keeping only the rows with mass greater than this global average.

filter: Subset rows using column values in dplyr: A Grammar of Data , In this tutorial, you will learn the following R functions from the dplyr package: Filter rows within a selection of variables; Remove missing values; Select random for greater than; <=: for less than or equal to; >=: for greater than or equal to� Details. Note that dplyr is not yet smart enough to optimise filtering optimisation on grouped datasets that don't need grouped calculations. For this reason, filtering is often considerably faster on ungroup()ed data.

Subset Data Frame Rows in R, Learn to to filter by value in R using filter() command from dplyr package. not equal to; > : greater than; < : less than; >= : greater or equal to� Since we need to use logic to specify how to filter our data, let’s briefly review how logic works in R. In R, we can make logic statements that are evaluated as true or false. Remember that R has special values for true and false: True or False. Ok. Here’s a quick example of a logic statement that you can type into R: 10 > 1 Is 10 greater

Comments
  • please provide some sample data for a fully reproducible example. thanks!
  • Thank you Arthur Yip. I have edited the question. Hope it helps.
  • Oh I finally spotted the minor error in your code. I fixed it in the answer below.
  • Looks like the custom filter code came from stackoverflow.com/questions/43981187/… - would be helpful next time to link to source material
  • This gives the result but the format of the result would still be hard to work with. I am expecting a result with a table having the same columns but just the genes narrowed down. I did this for narrowing down genes with pval<0.05 but this was just one requirement so it was fine, unlike here for fc where I have two requirements <-0.584963 and >0.584963.
  • right now the code produces a column with all the gene names. What do you need/expect?
  • I have just re-edited the question with the expected format as I was unsure on how to put it in the comment
  • Thank you ha_pu for the answer and sorry I was not clear on the question. I have edited my question now and added an example data as well. I have multiple columns and want to extract a row in which at least one of the columns has >0.584963 or <-0,584963.
  • Thanks for the clarifications, I have adapted my response accordingly.
  • Sayan wants the columns with "fc" so we can say filter_at(vars(contains("fc")), ... which is basically a much simpler version of the grepl and custom filt that was in the opening post.
  • Depeding on your needs, you can also use matches() in the vars() expression, allowing to use regex, which makes your variable selection even more flexible...