Filtering by numerical values in r where dataset is text-based

filter function in r
r dplyr filter multiple values
r subset dataframe by list of values
r filter data frame multiple conditions
r filter in list
r select rows containing string
subset in r
r filter dataframe by column value

I'm trying to filter these values when they are greater than 5, but my given column of data has values expressed through a text form, like so:

View(vardata)

C1    Variation
DNA   GT=00.15,TT=08.11,TA=00.05,GA=00.00
RNA   GAU=00.00,GGU=00.90
DNA   TGGTTA=00.45,TTGATAA=21.8
DNA   ATGG=11.5
RNA   GUG=00.05,UGG=00.00
DNA   ATA=00.15,ATG=00.95

I sincerely have no clue how to make R interpret the values included in that form as numerical ones, so as to filter them.

As I don't need to have specified which code of letters has a value greater than X number, in theory I've been trying to just filter these values through

selectedvalues = subset(vardata, c(Variation) > 5)

Where I would take only the values where the Variation column has a numeric value greater than 5, where I could get a thing like:

View(selectedvalues)

C1    Variation
DNA   GT=00.15,TT=08.11,TA=00.05,GA=00.00
DNA   TGGTTA=00.45,TTGATAA=21.8
DNA   ATGG=11.5

As only in these cases there appears a value greater than 5.

But, like I said, I cannot find a way where R would interpret the given values so as to scan them as numbers and not text or characters.

Here is an option using str_extract from stringr

library(stringr)
df1[sapply(str_extract_all(df1$Variation, "[0-9]+\\.[0-9]+"), 
         function(x) any(as.numeric(x) > 5)), ]
#   C1                           Variation
#1 DNA GT=00.15,TT=08.11,TA=00.05,GA=00.00
#3 DNA           TGGTTA=00.45,TTGATAA=21.8
#4 DNA                           ATGG=11.5

Filter with Text data, For this post, I am going to cover how we can work with text data to… call many other functions from different R packages directly inside the 'filter()' function. which would return TRUE or FALSE based on whether a given text or is in ' ORIGIN_CITY_NAME' column values and filter the data by using the� Filtering Numeric Variables. Numeric variables are the quantitative variables in a dataset. In the diamonds dataset, this includes the variables carat and price, among others. When working with numeric variables, it is easy to filter based on ranges of values.

Here is a base R approach using apply along with strsplit:

keep <- sapply(vardata$Variation, function(x) {
    sum(sapply(strsplit(x, ",\\s*")[[1]], function(y) {
        as.numeric(strsplit(y, "=")[[1]][2]) > 5
    })) > 0
})
vardata[keep, ]

C1                           Variation
1 DNA GT=00.15,TT=08.11,TA=00.05,GA=00.00
3 DNA           TGGTTA=00.45,TTGATAA=21.8
4 DNA                           ATGG=11.5

The idea behind this approach is to split first by comma:

[TGGTTA=00.45, TTGATAA=21.8]

Then, we split each of the above two terms a second time on =, to extract the actual number. If a given row have even a single number greater than 5, then we retain it.

Filtering Data, Filtering and subsetting in R of values on the server, as well as how to crosstab in a subset of a dataset. Other applications work just as intuitively. Filtering like this works by creating a dataset or variable object that has the filter embedded in it: Penalty for Snowden (categorical) ## $perc_skipped: perc_skipped (numeric )� The filter() verb helps to keep the observations following a criteria. The filter() works exactly like select(), you pass the data frame first and then a condition separated by a comma: filter(df, condition) arguments: - df: dataset used to filter the data - condition: Condition used to filter the data One criteria

library(dplyr)
library(stringr)
#\\d* 0 or more digits, \\.? 0 or 1 dot, \\d+ 1 or more digits
df %>% mutate(digits=str_match_all(Variation,'\\d*\\.?\\d+'),
              flag=sapply(digits,function(x)sum(as.numeric(x)>5))) %>% 
              filter(flag>0)

     C1                           Variation                     digits flag
  1 DNA GT=00.15,TT=08.11,TA=00.05,GA=00.00 00.15, 08.11, 00.05, 00.00    1
  2 DNA           TGGTTA=00.45,TTGATAA=21.8                00.45, 21.8    1
  3 DNA                           ATGG=11.5                       11.5    1

Data

df <- read.table(text = "
C1    Variation
DNA   'GT=00.15,TT=08.11,TA=00.05,GA=00.00'
                 RNA   'GAU=00.00,GGU=00.90'
                 DNA   'TGGTTA=00.45,TTGATAA=21.8'
                 DNA   'ATGG=11.5'
                 RNA   'GUG=00.05,UGG=00.00'
                 DNA   'ATA=00.15,ATG=00.95'
                 ", header=TRUE)

Data Wrangling Part 3: Basic and more advanced ways to filter rows, This dataset is built into ggplot2, so if you load tidyverse you will get it. You can filter numeric variables based on their values. There are two main options for this: base R's grepl() function, or str_detect() from the stringr package. I had this plan for text analysis by character from the Parks & Recreation� Filtering by numerical values in r where dataset is text-based. Ask Question R interpret the values included in that form as numerical ones, so as to filter them.

filter: Subset rows using column values in dplyr: A Grammar of Data , The filter() function is used to subset a data frame, retaining all rows that Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [. Grouped tibbles Methods See Also Examples. View source: R /filter.R data set), keeping only the rows with mass greater than this global average. Filter(is.numeric, x) (Gives categorical columns in the dataset) From a dataframe extract columns with numerical values. 0.

Manipulating data with R, In today's class we will process data using R, which is a very powerful tool, When referring to values entered as text, or to dates, put them in quote marks, like this: This is often used when filtering data, as we will see. Join: Merging entries from two or more datasets based on common field(s), e.g. unique ID number, last� The function to use only specific rows iscalled filter()in dplyr. The general syntax of filter is:filter(dataset, condition). In case you filter inside a pipeline, youwill only see the condition argument as the dataset is piped into thefunction. Filtering rows based on a numeric variable.

Manipulating data tables with dplyr, The dataset was downloaded from http://faostat3.fao.org/ in June of 2014. Run the The basic set of R tools can accomplish many data table queries, but the syntax can be Tables can be subsetted by rows based on column values. library(dplyr) dat.query3 <- filter(dat, Crop == "Oats" | Crop == "Buckwheat", Country� As you can see based on the output of the RStudio console, the columns x1 and x3 are numeric. x2 is a character string and x4 is a factor variable. Next, I’ll show you how to extract only numeric columns from our data set. Keep on reading! Example 1: Extract Numeric Columns from Data Frame [Base R]

Comments
  • OMG this worked like a charm, though I wouldn't be understanding the "[0-9]+\\.[0-9]+" part
  • @CamilaEigner Thanks. [0-9]+ implies one or more digits (0-9), followed by a dot (\\.) followed by one ore more digits [0-9]+`
  • Perhaps you meant stringr?
  • You're a wizard!