Subset a dataframe in two dataframes by values in two columns of another dataframe

r subset dataframe by list of values
r extract rows with certain value
r subset dataframe by column value
subset dataframe in r
r subset dataframe by multiple column value
r subset dataframe by column name
r add column to dataframe with value
r select rows by condition

I have two dataframes. df1 looks like (or the column of df1 i am interested in):

position
2
6
12
18
25
31

and df2 looks like:

start   end
2       17
24      29

I want to keep the positions in df1 that only fall between (<= or >=) the start and end coordinates of df2, so that df1 looks like this after filtering:

position
2
6
12
25

Then I want to keep the filtered out "leftover" values of df1 in another dataframe, let's call it df4.

df4 would look like:

position
18
31

I can do this the perl way using a for loop (coming from perl and currently learning R) but I am pretty sure I can somehow use filter or some other dplyr or base R combination of functions to achieve this.

Any help would be appreciated!

EDIT: Added df4 calculation as my question was marked as duplicate and this is sth not found in the other similar threads. This is something I am interested in doing to make my code faster!


We can full_join these two data frames and then filter for the rows within the start and end column. The Flag column in the example is just for the join. Finally, we can use distinct to remove duplicated rows.

library(dplyr)

df3 <- df1 %>%
  mutate(Flag = 1) %>%
  full_join(df2 %>% mutate(Flag = 1), by = "Flag") %>%
  filter(position >= start, position <= end) %>%
  distinct(position)
df3
#   position
# 1        3
# 2        6
# 3       12
# 4       25

DATA

df1 <- read.table(text = "position
3
                  6
                  12
                  18
                  25
                  31", header = TRUE)

df2 <- read.table(text = "start   end
2       17
24      29",
                  header = TRUE)

Combining DataFrames with Pandas – Data Analysis and , Combine data from multiple files into a single DataFrame using merge and concat. In this case, we have told pandas to assign empty values in our CSV to NaN Let's grab two subsets of our data to see how this works. Another way to combine DataFrames is to use columns in each dataset that contain common values  This version of the subset command narrows your data frame down to only the elements you want to look at. Other Ways to Subset A Data Frame in R. There are actually many ways to subset a data frame using R. While the subset command is the simplest and most intuitive way to handle this, you can manipulate data directly from the data frame syntax.


Single line, simple base solution:

df1[df1$position %in% unlist(apply(df2,1,function(x) x["start"]:x["end"])),]

The apply simply generates a vector of all the cases that fall between starts and ends.

Indexing, Slicing and Subsetting DataFrames in Python – Data , Manipulate and extract data using column headings and index locations. Employ slicing to select sets of data from a DataFrame. Let's look at what happens when we reassign the values within a subset of the DataFrame that references another DataFrame object: What is the difference between these two dataframes? 1.I am working with two csv files and imported as dataframe, df1 and df2. 2.df1 has 50000 rows and df2 has 150000 rows. 3.I want to compare (iterate through each row) the 'time' of df2 with df1, find the difference in time and return the values of all column corresponding to similar row, save it in df3 (time synchronization)


Here is a base R option

do.call(rbind, Map(function(i, j) 
  df1[df1$position > i & df1$position < j, , drop = FALSE], 
      df2$start, df2$end))
#    position
#1        3
#2        6
#3       12
#5       25

Or using fuzzy_join

library(fuzzyjoin)
library(dplyr)
fuzzy_inner_join(df1, df2, by = c('position' = 'start', 'position' = 'end'), 
        match_fun = list(`>`, `<`)) %>%
    select(position)
#  position
#1        3
#2        6
#3       12
#4       25

Or use a non-equi join from data.table

setDT(df2)[df1, on = .(start < position, end > position), .(position), nomatch = 0]
#   position
#1:        3
#2:        6
#3:       12
#4:       25
data
df1 <- structure(list(position = c(3L, 6L, 12L, 18L, 25L, 31L)), row.names = c(NA, 
 -6L), class = "data.frame")

df2 <- structure(list(start = c(2L, 24L), end = c(17L, 29L)), 
 class = "data.frame", row.names = c(NA, -2L))

4 Subsetting, Logical vectors select elements where the corresponding logical value is TRUE . By default, subsetting a matrix or data frame with a single number, a single your code with a data frame or matrix with multiple columns, and it works. For this reason, tibbles default to drop = FALSE , and [ always returns another tibble. The subset() function takes 3 arguments: the data frame you want subsetted, the rows corresponding to the condition by which you want it subsetted, and the columns you want returned.


R Data Frame: Create, Append, Select, Subset, What is a Data Frame? A data frame is a list of vectors which are of equal length. For instance, 1:3 intends to select values from 1 to 3. In below For instance, the code below extracts two columns: ID and store. # Slice with  First and foremost, let's create a DataFrame with a dataset that contains 5 rows and 4 columns and values from ranging from 0 to 19. We will use the arange()and reshape()functions from NumPy library to create a two-dimensional array and this array is passed to the Pandas DataFrame constructor function.


Here is another take that starts with df2 (I don't say this is wiser than Andre's approach):

subset(df1, apply(apply(df2, 1, function (x) {dplyr::between(df1$position, x["start"], x["end"])}), 1, any))

You should probably run some benchmarks on the proposed approaches before making a decision.

Mapping column values of one DataFrame to another DataFrame , I have two data frames df1 and df2 which look something like this. cat1 cat2 cat3 0 10 25 12 1  How to associate a row from one data frame with a value in another data frame based on three columns I have two data frames, df1 one has a list of gene variants from a vcf file and df2 two has a lis


Data wrangling: dataframes, matrices, and lists, Demonstrate how to subset, merge, and create new datasets from existing data metadata[1, 1] # element from the first row in the first column of the data frame Just like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values:. I have two Pandas DataFrames and I want to subset df_all based on the values within to_keep. Unfortunately this isn't straight forward pd.merge() or df.join() because I have multiple columns that I want to match on, and I don't care what order the match happens. I don't care if df_all['source'] matches in either to_keep['from'] OR 'to_keep['to']


Data Wrangling in R: Combining, Merging and Reshaping Data, Rda") # Sometimes we have multiple data frames we want to combine. There are typically # three ways to do this: (1) stack on top of each other, (2) place variable in the melted data frame as does the values under # those column headers. rm(aqLong2) # Let's subset and reshape the election data to long format to  Compare columns of two DataFrames and create Pandas Series It's also possible to use direct assign operation to the original DataFrame and create new column - named 'enh1' in this case. For this purpose the result of the conditions should be passed to pd.Series constructor.


How to select columns from the dataframe based on variables from , I want to select columns based on another dataframe (df2). R 2 KO.GS2 AR.R 3 WT.GS1 AR.R 4 WT.GS2 BL.PD 5 WT.GS3 BL.PD [R] - subset two dataframes How to associate a row from one data frame with a value in another I have two data frames, df1 one has a list of gene variants from a vcf file  Indexing, Slicing and Subsetting DataFrames in Python. In lesson 01, we read a CSV into a python Pandas DataFrame. We learned how to save the DataFrame to a named object, how to perform basic math on the data, how to calculate summary statistics and how to create plots of the data.