How to partial search for words using regex python

python regex match word in string
python regex not
python regex multiple patterns
regular expression in python for beginners
python pattern matching
python regex cheat sheet
python split regex
python regex case-insensitive

I want to get all 'xlsx' files that somewhere have 'feedback report' in them. I want to make this filter very strong. So any partial matches like 'feedback_report', 'feedback report', 'Feedback Report' should all return true.

Example file names :

  1. ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx
  2. ZL-SA_feedback report_012844.xlsx
  3. ASARanem-SA_Feedback Report_012844.xlsx

A futile attempt below.

regex = re.compile(r"[a-zA-Z0-0]*[fF][eE][eE][dD][bB][aA][cC][kK]\s[rR][eE][pP][oO][rR][tT][a-zA-Z0-0]*.xlsx")

This will work:

re.search("(feedback)(.*?|\s)(report)",string,re.IGNORECASE)

Tested it on the following input list with the code

import re
a=["ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx",
"ZL-SA_feedback report_012844.xlsx",
"ASARanem-SA_Feedback Report_012844.xlsx",
"some report",
"feedback-report"]

for i in a:
    print(re.search("(feedback)(.*?|\s)(report)",i,re.IGNORECASE))

the output as expected by OP from the same is:

<_sre.SRE_Match object; span=(21, 36), match='FEEDBACK_REPORT'>
<_sre.SRE_Match object; span=(6, 21), match='feedback report'>
<_sre.SRE_Match object; span=(12, 27), match='Feedback Report'>
None
<_sre.SRE_Match object; span=(0, 15), match='feedback-report'>

Regular Expression HOWTO, This document is an introductory tutorial to using regular expressions in Python with For example, the regular expression test will match the string test exactly. For example, if you wish to match the word From only at the  In this tutorial, you will learn about regular expressions (RegEx), and use Python's re module to work with RegEx (with the help of examples). A Re gular Ex pression (RegEx) is a sequence of characters that defines a search pattern. The above code defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with s.

Your regex is nearly acceptable, but the beginning and ending portions will not match correctly because you have underscores in your examples. I'm not sure how representative these are of your actual data but to match what you have here you would need:

regex = re.compile(r"[a-zA-Z0-0\_\-\s]*(feedback)[\s\_\-](report)[a-zA-Z0-0\_\-\s]*.xlsx", 
    flags = re.IGNORECASE)

Another thing you should probably be careful of is to make sure you're actually working with just the file name and not the file path because in that case you'd have to worry about \ and / characters. Also note that I'm only matching for the exact characters I noticed you were missing. You may want to try

regex = re.compile(r"*(feedback)*(report)*.xlsx", flags = re.IGNORECASE)

but, again, I'm not sure what your data actually looks like. Hope this helps

How to Check if a Python String Contains Another String?, . Otherwise, it returns False. The find method returns the index of the beginning of the substring if found, otherwise -1 is returned. The simple way to search for a string in a list is just to use ‘if string in list’. eg: But what if you need to search for just ‘cat’ or some other regular expression and return a list of the list items that match, or a list of selected parts of list items that match.

First of all, lowercase file names in order to minimize the number of possible options

regex = re.compile('feedback.{0,3}report.*\.xlsx?', flags=re.IGNORECASE)

looks for 'feedback', next up to 3 whatever characters, next 'report', and whatever again, ending with a dot and xls or xlsx extension

or just

filename = 'ZL-SA_feedback report_012844.xlsx'
matched = re.search('feedback.{0,3}report.*\.xlsx?', filename.lower())

Also you can use python glob module to search files in linux fashion:

import glob
glob.glob('*[fF][eE][dD][bB][aA][cC][kK]*[rR][eE][pP][oO][rR][tT]*.xlsx')

Python Regex: re.match(), re.search(), re.findall() with , . \b | Matches the boundary (or empty string) at the start and end of a word, that is, between \w and \W . RegEx can be used to check if a string contains the specified search pattern. Python has a built-in package called re, which can be used to work with Regular Expressions. Import the re module: RegEx in Python. When you have imported the re module, you can start using regular expressions: Search the string to see if it starts with "The" and ends

Could you use just string methods like the following?

'feedbackreport' in name.replace('_', '').replace(' ', '').lower()

And also

name.endswith('.xlsx')

Giving you something like:

fileList = [
    'ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx',
    'ZL-SA_feedback report_012844.xlsx',
    'ASARanem-SA_Feedback Report_012844.xlsx'
]

fileNames = [name for name in fileList
             if ('feedbackreport' in name.replace('_', '').replace(' ', '').lower()
                 and name.endswith('.xlsx'))]

If there are more characters that could cause problems such as - then you could also make a quick function to remove bad characters:

def remove_bad_chars(string, chars): 
    for char in chars:
        string = string.replace(char, '')
    return string

Amending the appropriate portion of the if statement to:

if 'feedbackreport' in remove_bad_chars(name, '.,?!\'-/:;()"\\~ ').lower()
# included a white space in the string of bad characters

Python Regex Cheat Sheet: Regular Expressions in Python, How do I extract a string from a normal expression in Python? You can also use regexes to modify a string or to split it apart in various ways. These "higher order" operations all start by first matching text with the regex string, and then the string can be manipulated (like being split) once the match is found. All this is made possible by the re module available in Python,

I used this for my string based on all your suggestions. This works for me in 99% of the cases.

regex = re.compile(r"[a-zA-Z0-9\_\-\s]*(feedback)(\s|\_)(report)s?[a-zA-Z0-9\_\-\s]*.xlsx",flags = re.IGNORECASE)

How to extract the substring between two markers?, ( re ) module. >>> s = '/tmp/10508. constantstring' >>> s. split('/tmp/')[1]. This document is an introductory tutorial to using regular expressions in Python with the re module. It provides a gentler introduction than the corresponding section in the Library Reference. The re module was added in Python 1.5, and provides Perl-style regular expression patterns. Earlier versions of Python came with the regex module, which

Python Gotcha: Word boundaries in regular expressions , Be careful trying to match word boundaries in Python using regular expression searches for whole words while avoiding partial matches. In Python, creating a new regular expression pattern to match many strings can be slow, so it is recommended that you compile them if you need to be testing or extracting information from many input strings using the same expression. This method returns a re.RegexObject. regexObject = re.compile(pattern, flags=0)

Python re.match, search Examples, Execute regular expressions with re: call match, search, split and findall. "no match"] # Loop. for element in list: # Match if two words starting with letter d. m  #Python RegEx search() Method The search() function searches the string for the match and returns the Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned. See the following code example.

Introduction to Regular Expressions in Python, These "higher order" operations all start by first matching text with the regex string​, A regular expression specifies a pattern that aims to match the input string. In the following code we are simply trying to find if the word "puppy" appears in  Regular Expression Syntax¶. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Comments
  • I'm not a python developer but .*feedback[\s_]report.*\.xlsx seems to be sufficient with the IGNORECASE option.
  • Yes you are absolutely correct and it lessens a lot of permutations pointed by everyone on this thread.
  • This wouldn't work because strip only removes leading and trailing whitespace