Filter dataframe using a second dataframe value by value

pandas dataframe filter by column value like
pandas dataframe filter multiple columns
pandas dataframe filter multiple conditions
pandas filter function
pandas find value in any column
pandas series filter by value
pandas filter dataframe by another dataframe
pandas lookup value in another dataframe

I always have had trouble with loops so I am asking here. 2 dataframes. 1 very large and 1 much smaller. Sample versions below.

Dataframe 1

ID        Value
1         apples
1         apples
1         bananas
1         grapes
1         mangoes
1         oranges
1         grapes
1         apples
1         grapes
2         apples
2         apples
2         passionfruits
2         bananas
2         apples
2         apples
2         passionfruits
2         grapes
2         mangoes
2         apples
3         apples
3         bananas
3         oranges
3         apples
3         grapes
3         grapes
3         passionfruits
3         passionfruits
3         oranges
4         apples
4         oranges
4         mangoes
4         bananas
4         grapes
4         grapes
4         grapes
4         apples
4         oranges
4         grapes
4         mangoes
4         mangoes
4         apples
4         oranges
5         passionfruits
5         apples
5         oranges
5         oranges
5         mangoes
5         grapes
5         apples
5         bananas

Dataframe 2

Value
apples
apples
bananas
grapes
mangoes
mangoes
grapes
apples
apples

The different IDs in dataframe 1 are considered as sets. The dataframe 2 in its entirety will be an approximate or exact match to one of the sets. I know there is plenty of code to filter using the entire dataframe 2 to match with 1. But that is not what I require. I require it to filter sequentially value by value with conditions attached. The condition should be whether the previous value matches.

So in this example with the first value nothing happens because all IDs have 'apples'. The second value = 'apples' given that previous value='apples' filters out ID = 4 because it doesnt contain 'apples' occurring twice in a row. Now in the filtered dataframe 1 we search for the third value and so on. It stops only when 1 ID set remains in Dataframe 1. So in this case after the 3rd iteration. Result should be

Dataframe 1

ID        Value
1         apples
1         apples
1         bananas
1         grapes
1         mangoes
1         oranges
1         grapes
1         apples
1         grapes

A possible approach with data.table (an adaptation from my answer here):

# load packages
library(data.table)

# create a function which calculates match-score with 'df2$Value'
maxscore <- function(x, y) {
  m <- mapply('==', shift(x, type = 'lead', n = 0:(length(y) - 1)), y)
  max(rowSums(m, na.rm = TRUE))
}

# calculate the match-score for each group
# and filter out the other groups
setDT(df1)[, score := maxscore(Value, df2$Value), by = ID
           ][score == max(score)][, score := NULL][]

which gives:

   ID   Value
1:  1  apples
2:  1  apples
3:  1 bananas
4:  1  grapes
5:  1 mangoes
6:  1 oranges
7:  1  grapes
8:  1  apples
9:  1  grapes

You can use that function in a dplyr-chain as well (but you will still need the data.table-package for the shift-function):

library(dplyr)
df1 %>% 
  group_by(ID) %>% 
  mutate(m = maxscore(Value, df2$Value)) %>% 
  ungroup() %>% 
  filter(m == max(m)) %>% 
  select(-m)

An alternative implementation of the maxscore-function (inspired by @doscendo's answer here):

maxscore2 <- function(x, y) {
  w <- which(x == y[1])
  v <- sapply(w, function(i) sum(x[i:(i+(length(y)-1))] == y, na.rm = TRUE))
  max(v)
}

How To Filter Pandas Dataframe By Values of Column?, In the above example, we used two steps, 1) create boolean variable satisfying the filtering condition 2) use boolean variable to filter rows. We can then use this boolean variable to filter the dataframe. After subsetting we can see that new dataframe is much smaller in size. # filter rows for year 2002 using the boolean variable >gapminder_2002 = gapminder[is_2002] >print(gapminder_2002.shape) (142, 6) We have successfully filtered pandas dataframe based on values of a column.

We can merge Value for each ID using a token separator (say #) and then write a custom function that compare how many sequential tokens were matched. Finally, select data for ID that has got the maximum match.

library(dplyr)

# This function matches and count tokens separated by `#`
# matched_count ("a#b#c","a#e#c#d")  will return 1
matched_count <- function(x, y){
  x_v <- strsplit(x, split = "#")[[1]]
  y_v <- strsplit(y, split = "#")[[1]]
  max_len <- max(length(x_v), length(y_v))
  length(x_v) <- max_len
  length(y_v) <- max_len
  sum(x_v==y_v,na.rm = TRUE)
}    


Dataframe1 %>% group_by(ID) %>%
  mutate(CompStr = paste0(Value, collapse="#")) %>% #collapse values for ID
  mutate(CompStrdf2 = paste0(Dataframe2$Value, collapse="#")) %>% 
  mutate(max_match = matched_count(CompStr, CompStrdf2)) %>%
  ungroup() %>%
  filter(max_match == max(max_match)) %>%
  select(ID, Value)

# ID Value  
# <int> <chr>  
# 1     1 apples 
# 2     1 apples 
# 3     1 bananas
# 4     1 grapes 
# 5     1 mangoes
# 6     1 oranges
# 7     1 grapes 
# 8     1 apples 
# 9     1 grapes 

How to Filter Rows of a Pandas DataFrame by Column Value, We may be presented with a Table, and want to perform custom filtering operations. How to Filter Rows of a Pandas DataFrame by Column Value Pandas introduces two data types to Python: the Series and DataFrame. This tutorial will focus on two easy ways to filter a Dataframe by column value. The following example is the result of a BLAST search. In bioinformatics, BLAST (basic local alignment search tool) is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or

I suggest turning the Values in each group into a string and comparing their string edit distance. adist - Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

string_edit_dist <- function(vec1, vec2) {
    c(adist(paste0(vec1, collapse=""), paste0(vec2, collapse="")))
}    

ind <- which.min(sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value)))
df1[df1$ID==ind, ]

  # ID   Value
# 1  1  apples
# 2  1  apples
# 3  1 bananas
# 4  1  grapes
# 5  1 mangoes
# 6  1 oranges
# 7  1  grapes
# 8  1  apples
# 9  1  grapes

Here is the string_edit_distance for each group

sapply(seq_along(unique(df1$ID)), function(i) string_edit_dist(df1$Value[df1$ID==i], df2$Value))
# 7 35 45 46 27

Multiple filtering pandas columns based on values in another column, import pandas as pd df1 = pd.DataFrame({"Random numbers 1": pd.np.random.​randn(6), "Campaign": ["A"] * 5 + ["B"], "Merchant": [1, 1, 1, 2, 3,  Filter Pandas Dataframe by Row and Column Position Suppose you want to select specific rows by their position (let's say from second through fifth row). We can use df.iloc[ ] function for the same.

pandas.DataFrame.filter, Subset the dataframe rows or columns according to the specified index labels. Note that this axis defaults to the info axis that is used when indexing with [] . It is a standrad way to select the subset of data using the values in the dataframe and applying conditions on it We are using the same multiple conditions here also to filter the rows from pur original dataframe with salary >= 100 and Football team starts with alphabet ‘S’ and Age is less than 60

7 Ways To Filter A Pandas Dataframe, What pandas dataframe filtering options are available and how to use them effectively to We want to view rows where the reports value is in our numbers list: Index, Select and Filter dataframe in pandas python – In this tutorial we will learn how to index the dataframe in pandas python with example, How to select and filter the dataframe in pandas python with column name and column index using .ix(), .iloc() and .loc()

Filter dataframe rows if value in column is in a set list of values , To filter dataframe rows if the value in the column is in a setlist of values you can use the isin method which is shown how to do in the  Show rows where year value is null (aka. missing): df[df['year'].isnull()] Wrapping Up. Okay so we covered some functions that we can use to filter our dataframe. Sure, there are more ways to filter stuff out but these are the ones that I find the most useful and easiest to use.

Comments
  • For ID 1, it is mangoes followed by oranges, but in data.frame it is mangoes, then mangoes