Searching through columns for a specific pattern

pandas regex extract
awk search pattern in column
awk print column matching pattern
awk match pattern in column
pandas regex replace
importance of regular expressions in data analytics
apply regex to pandas column
pandas regex match

I have a pandas column named "A" which has values like-

0
0
1
0
0
0
0

Now I want to search through this column for the pattern 0 1 0 and identify the row in column 'B' corresponding to the 1 in the column 'A'.

For example

'B'  'A'
 12   0
 14   0
 6    0
 3    1
 6    0
 8    0 

Now I want it to return 3 in column 'B'. Is there any other solution other than applying nested if else?

You can use numpy for improve performance - a bit modified solution from this:

pat = [0,1,0]
N = len(pat)
df = pd.DataFrame({'B':range(4, 14), 'A':[0,0,1,0,0,1,0,0,1,0]})
print (df)
    B  A
0   4  0
1   5  0
2   6  1
3   7  0
4   8  0
5   9  1
6  10  0
7  11  0
8  12  1
9  13  0

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['A'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)

print (rolling_window(arr, N))

[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]]

c = np.mgrid[0:len(b)][b]
#create indices of matched pattern
print (c)
[1 4 7]

#strides by column B indexed by indices of matched pattern    
d = rolling_window(df['B'].values, N)[c]
print (d)
[[ 5  6  7]
 [ 8  9 10]
 [11 12 13]]

#select second 'column'
e = d[:, 1].tolist()
print (e)
[6, 9, 12]

Search in specific column for pattern and output entire line, The simplest approach would probably be awk : awk -F'|' '$4~/^5/' file. The -F'|' sets the field separator to | . The $4~/^5/ will be true if the 4th  One thing to keep in mind is that if you are using the % in front of the value such as '%land' this will force SQL Server to scan the table and not use the indexes. Also, since you are searching through all columns using an OR clause it is a pretty good bet that you will do a table scan, so be aware of this on large tables.

The following code starts by specifying the pattern you want to match. In your case, this was 0 1 0. You also specify which coordinate in that pattern you want to correspond to the index you're pulling from column B. You wanted the middle element which is the 1 coordinate in a 0-based indexing scheme.

From there, we're taking column A and shifting it with Series.shift(). By default, this includes NaN values for missing coordinates. The NaN won't match with 0 or 1 or any other value of interest, so we can directly compare that shifted column to whatever we're supposed to be matching and get the exact kind of True or False values that we want.

In order to match your entire pattern, we need to combine those values with a logical AND. To do that, we reduce each shifted column pairwise with s1 & s2. This returns a new column which is coordinate-wise the logical AND of the originals.

Finally, we use this boolean result which is a series with as many rows as the original DataFrame df, and we select from df['B'] using it. This returns a new series with just the values from df['B'] at the intended coordinates.

from functools import reduce

matching_values = (0, 1, 0)
matching_index = 1

df['B'][reduce(
    lambda s1, s2: s1 & s2,
    (df['A'].shift(i)==v for i, v in zip(
        xrange(-matching_index, len(matching_values)-matching_index),
        matching_values)))]

If using Python 2.x, you don't need to import reduce() first, but in Python 3.x the zip() doesn't build an intermediate list, saving on CPU and RAM resources.

Depending on what you're doing, this could easily be extracted into a function exposing the relevant parameters. Magic strings of A and B probably aren't ideal and would be appropriate choices. The matching_values and matching_index are other likely candidates.

Tutorial: Python Regex (Regular Expressions) for Data Scientists, In this case, we're having it search through all of fh , the file with our selected emails Regular expressions work by using these shorthand patterns to find specific Each of these categories will become a column in our pandas dataframe (i.e.,  The -a switch causes perl to split its input fields on the value given by -F into the array @F. We then print if the 4th element (field) of the array (arrays start counting at 0) starts with a 5. grep -E '^([^|]*\|){3}5' file The regex will match a string of non-| followed by a | 3 times, and then a 5.

from scipy.signal import convolve
pat = [0,1,0]
df = pd.DataFrame({'B':range(4, 14), 'A':[0,0,1,0,0,1,0,0,1,0]})
s2 = convolve(df['A'],[0,1,0],mode = 'valid')
s2 = pd.Series(s2)
df.B.iloc[s2[s2==1].index + 1].values

o/p:

array([ 6,  9, 12])

o/p for your given example:

array([3])

Replace values in Pandas dataframe using regex, In this post, we will use regular expressions to replace strings which have some pattern to it. Problem #1 : You are given a dataframe which contains the details about various events in different Search for such names and remove the additional details. function to apply our customized function on each values the column. Supposing you have a column of names (column A) and you want to pull the First name and Last name into separate columns. To get the first name, you can use FIND (or SEARCH) in conjunction with the LEFT function: =LEFT(A2, FIND(" ", A2)-1) or =LEFT(A2, SEARCH(" ", A2)-1)

Change your original data to make it suitable for more data:

import pandas as pd
o = pd.DataFrame({'A': [0, 1, 0, 1, 0, 0], 'B': [12, 14, 6, 3, 6, 8]})
b = o["A"]
m = [i+1 for (i, _) in enumerate(b) if i+2<len(b) and str(b[i])+str(b[i+1]) + str(b[i+2]) == '010']
print(o.loc[m]['B'].tolist())

So, for next input:

    A   B
0   0   12
1   1   14
2   0   6
3   1   3
4   0   6
5   0   8

Will output:

[14, 3]

How to use Regex in Pandas, Its really helpful if you want to find the names starting with a particular character or search for a pattern within a dataframe column or extract the  I need a formua to create a new column that checks a exising column in a table and provides a new value based on multiple condtions. The formula shall find specified text contained in a longer text string, the searched text can be at the beginning the end or the end of the string.

Working with text data, object dtype breaks dtype-specific operations like DataFrame.select_dtypes() . These are accessed via the str attribute and generally have names matching the In [25]: df Out[25]: Column A Column B 0 0.469112 -0.282863 1 -1.509059 character (for >1 len patterns) In [43]: dollars.str.replace(r'-\$', '-') Out[43]: 0 12 1  In this example, a single character ‘a’ is searched in each string of Name column using str.find() method. Start and end parameters are kept default. The returned series is stored in a new column so that the indexes can be compared by looking directly. Before applying this method, null rows are dropped using .dropna() to avoid errors.

Matching character strings in the WHERE clause, In SQL, the LIKE keyword is used to search for patterns. to constants, variables, or other columns that contain the wildcard characters displayed in the table. Search by pattern in R. To find substrings, you can use the grep() function, which takes two essential arguments: pattern: The pattern you want to find. x: The character vector you want to search. Suppose you want to find all the states that contain the pattern New. Do it like this:

How To Select Columns Using Prefix/Suffix of Column Names in , We will first use Pandas filter function with some simple regular expression for pattern matching to select the columns of interest. And then, we  I have a data frame of 50 columns and 400 rows containing all numbers and I want to search all columns for values that are higher than a certain number. I know how to do it with one column: df.loc[df['col1'] > '20']

Comments
  • What do you want done for the first and last elements of A? They don't have both of the 0 values you're looking for, but arguably they match the pattern. Are they included or excluded?
  • @HansMusgrave it should match the pattern 0 1 0...so in the case of the first and last since they dont we exclude them
  • Got it. That kind of information should probably be included in the question btw. Is there guaranteed to only be a single row in B matching that pattern? If not, do you want all answers?
  • No there could be multiple rows matching the pattern
  • @user3483203 - Thank you.
  • Can I get the index of the column B also where the number was found?
  • @ubuntu_noob You can replace d = rolling_window(df['B'].values, N)[c] print (d) to d = rolling_window(df. index.values, N)[c] print (d) and then e = d[:, 1].tolist() print (e)
  • I doubt its correctness. If anyone find any please comment.
  • df = pd.DataFrame({'B':range(4, 10), 'A':[0,1,0,1,1,1]}), seems issue?
  • @atline what is the issue with df = pd.DataFrame({'B':range(4, 10), 'A':[0,1,0,1,1,1]})?