Perl-REGEXP How to match substring from words w/o alternate patterns?

perl regex match variable
perl regex modifiers
perl regex cheat sheet
perl
perl regex capture
perl regex tester
perl regex not match
perl substr

Good afternoon all,

I have a string of blank separated words. I need to find the words from that string that matches an alphanumeric pattern, partial or whole word. I need words made only of alphanumeric characters.

To make my purpose clearer I have the string:

'foo bar quux foofoo foobar fooquux barfoo barbar barquux ' . 'quuxfoo quuxbar quuxquux [foo] (foo) {foo} foofoo barfoo ' . 'quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo'

and I want to find all words with 'foo' inside (only once per word) but not those with special characters (non alpha) like "[foo]", "{foo}"...

I have done this with the following piece of code in Perl:

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';
my @m = ($s=~/(\w+foo|foo\w+|^foo|foo$)/g) ;
say "@m";
say "Number of sub-strings matching the pattern: ", scalar @m;
print( sprintf("%02d: ",$_),
       ($s=~/(\w+foo|foo\w+|^foo|foo$)/g)[$_],
       qq(\n) )
    for (0..@m-1);

I get the result I want:

foo foofoo foobar fooquux barfoo quuxfoo foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo
Number of sub-strings matching the pattern: 15 
00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

But if I need (and I will) to add more patterns to search for in a more complex string it quickly becomes messy and I get confused with the succession of alternate patterns ('|').

Is there is someone to help me writing a shorter/cleaner pattern regexp to delimit the 'foo' (or any other) word/sub-word in a way that it could be written in one single pattern?

Thank you in advance.

GM

Strawberry 5.022 on W7/64, but I think it's fairly generic to any Perl above 5.016 or even 5.008;


I found the solution of dawg (and steffen too) suitable for me. Not the most readable, the grep one is more in accordance with my level of Perl, but I think, as pure regexp based, more able to handle future add of words with word limits handling.

$s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g


(?:(?<=\h)|^)  Assert either after a \h (horizontal space) or at start of line ^
(\w*foo\w*)    Capture a 'word' with 'foo' and only \w characters (or, [a-zA-Z0-9_] characters)
(?=\h|$)       Assert before either a \h horizontal space or end of line $

I would like to write here down what I understood of it so that you can correct me if I'm wrong before I intend to expand it for my actual needs.

(?:         # You start a non capturing group.
(?<=        # You start a lookbehind (so non capturing BY NATURE, am I right ?, because
            # if not, as it is being enclosed in round-brackets '()' it restarts to be
            # capturing even inside a non capturing group, isn't it?)
 \h         # In the lookbehind you look for an horizontal space (could \s have been used
            # there?)
 ^          # in the non capturing group but outside of the lookbehind you look for the
            # start of string anchor. Must not be present in the lookbehind group because
            # it requires a same length pattern size and ^ has length==0 while \h is
            # non zero.
\w*foo\w*   # You look for foo within an alphanum word. No pb to have '*' rather than '+'
            # because your left (and right, that we'll see it down) bound has been well
            # restricted.
(?=         # You start a lookforward pattern (non capturing by nature here again, right?),
            # to look for:
\h or $     # horiz space or end of string anchor. However the lookaround size is
            # different here as $ is still 0 length (as ^ anchor) and \h still non
            # zero. "AND YET IT MOVES" (I tested your regexp and it worked) because
            # only the lookbehind has the 'same-size' pattern restriction, right?

Thank you for your help, all of you, after that last point I won't bother you any longer with my little problems and consider my question fully answered. G.

It depends: if you want to get foobar from (foobar), it's easy. You just match foo with optional word characters before and after, and then on both sides a word boundary \b (which could be begin of input or some non-word character):

my @m = ($s=~/(\b\w*foo\w*\b)/g);
print( sprintf("%02d: ",$_),
    ($s=~/(\b\w*foo\w*\b)/g)[$_],
    qq(\n) )
for (0..@m-1);

Output:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foo
07: foo
08: foo
09: foofoo
10: barfoo
11: quuxfoo
12: foo2foo
13: foo2bar
14: foo2quux
15: foo2foo
16: bar2foo
17: quux2foo

If not, then it's a bit more difficult. Here I'd match begin-of-input or a space, then foo surrounded by optional word characters and then we need a (zero-length) assertion which requires a whitespace or end-of-input:

my @m = ($s=~/(?:^|\s)(\w*foo\w*)(?=\s|$)/g);
print( sprintf("%02d: ",$_),
    ($s=~/(?:^|\s)(\w*foo\w*)(?=\s|$)/g)[$_],
    qq(\n) )
for (0..@m-1);

Output:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

Special pattern matching character operators, The simplest and very common pattern matching character operators is the . for any single character to match where a . is placed in a regular expression. By default, the ^ character is guaranteed to match at only the beginning of the string, the Perl does certain optimizations with the assumption that the string contains​  Perl's regular expression engine applies these patterns to match or to replace portions of text. While mastering regular expressions is a daunting pursuit, a little knowledge will give you great power. You'll build up your knowledge over time, with practice, as you add more and more features to your toolkit.

You can split your string and filter the array:

use strict;
use warnings;

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';

my @res = grep {/foo/ && !/\W/}  split /\s/, $s;

print join(" ", @res);

perlrequick, NAME. perlrequick - Perl regular expressions quick start This page assumes you already know things, like what a "pattern" is, and the basic syntax of using them. A regex consisting of a word matches any string that contains that word: Here, all the alternatives match at the first string position, so the first matches. This is pretty easy in this case, in part because it's easy to match a string against multiple patterns in Perl, and also because my patterns are very simple -- no regular expressions involved. The Perl source code below shows a simple example of the pattern matching that I'm doing in my Perl script.

Perhaps filter the unwanted words first then use grep against the filtered words:

use strict;
use warnings;

my $s=
'foo bar quux foofoo foobar fooquux barfoo barbar barquux quuxfoo quuxbar quuxquux ' .
'[foo] (foo) {foo} foofoo barfoo quuxfoo foo2foo foo2bar foo2quux foo2foo bar2foo quux2foo';

my @words = ( $s=~/(?:(?<=\h)|^)(\w+)(?=\h|$)/g );

my @foos = grep(/foo/, @words);

while (my ($i, $v) = each @foos) {
    printf "%02d: %s\n", $i,$v;
}

Prints:

00: foo
01: foofoo
02: foobar
03: fooquux
04: barfoo
05: quuxfoo
06: foofoo
07: barfoo
08: quuxfoo
09: foo2foo
10: foo2bar
11: foo2quux
12: foo2foo
13: bar2foo
14: quux2foo

Alternatively, you can combine the filtering on a list of the words split by horizontal spaces and testing the resulting word is all alphanumeric:

@foos=grep {/foo/ && /^\w+$/} split /\h/, $s;  # same result

Or,

@foos=grep {/^\w*foo\w*$/} split /\h/, $s; 

Or, in a single regex:

@foos=($s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g);

As requested in comments, with:

$s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g


(?:(?<=\h)|^)  Assert either after a \h (horizontal space) or at start of line ^
(\w*foo\w*)    Capture a 'word' with 'foo' and only \w characters (or, [a-zA-Z0-9_] characters)
(?=\h|$)       Assert before either a \h horizontal space or end of line $

The only tricky part is (?:(?<=\h)|^). It is illegal in Perl to have a non-fixed width lookback such as (?<=\h|^) since ^ is a zero width and \h is not. (The regex (?<=\h|^) is legal in the PCRE library interestingly.) So (?:(?<=\h)|^) breaks the two assertion into one group.

perlre, This page describes the syntax of regular expressions in Perl. The portions of the string that match the portions of the pattern enclosed in Alternatives are tried from left to right, so the first alternative found for which the entire expression It evaluates to TRUE if, besides those 4 words, any of the sequences "feed", "​field",  This page assumes you already know things, like what a "pattern" is, and the basic syntax of using them. If you don't, see perlretut. Simple word matching. The simplest regex is simply a word, or more generally, a string of characters. A regex consisting of a word matches any string that contains that word: "Hello World" =~ /World/; # matches

Pattern Matching (Perl Cookbook, 2nd Edition), This chapter mostly presents recipes in which pattern matching forms part of the Perl's extensive and integrated support for regular expressions means that you That's why it didn't find "Ovines", since that string starts with a capital letter. the pattern to the start or end of the string, give alternatives for parts of a pattern,  The pattern .* is two different metacharacters that tell Perl to match everything between the start and end. Specifically, the metacharacter . means match any symbol except new line. The pattern quantifier * means match zero or more of the preceding symbol. That isn't exactly what I expected.

perlretut - Perl regular expressions tutorial, In Perl, the patterns described by regular expressions are used to search strings, A regexp consisting of a word matches any string that contains that word: What about choices among words or character strings? Here, all the alternatives match at the first string position, so the first alternative is the one that matches. The locale used will be the one in effect at the time of execution of the pattern match. This may not be the same as the compilation-time locale, and can differ from one match to another if there is an intervening call of the setlocale() function. Prior to v5.20, Perl did not support multi-byte locales. Starting then, UTF-8 locales are supported.

Pattern Matching with Regular Expressions - MySQL , Pattern Matching with Regular Expressions Problem You want to perform a Nor can you match string content based on character types such as letters or digits. Most of them are used also in the regular expressions understood by Perl, [a-z] matches any letter, [0-9] matches digits, and [a-z0-9] matches letters or digits. The regexp matches an open parenthesis, one or more copies of an alternation, and a close parenthesis. The alternation is two-way, with the first alternative [^ ()] + matching a substring with no parentheses and the second alternative \ ([^ ()] *\) matching a substring delimited by parentheses.

Comments
  • Do you want to get foobar from (foobar) or shouldn't that match at all?
  • @steffen: no, you are right, I don't want to have (foo) nor [foo] nor {foo} nor ;foo;, etc.
  • Hey, move the second part to a new answer and accept it ;)
  • In order to exclude ( before the word or ) after (so parens as word-boundary) I'd find it clearer to make your own word boundary ([^()...]) in the first example (in a variable with qr)
  • This is the right direction. Keep things simple. Just remember that the definition of 'word character' as used by regex is letters, digits, and underscore, so you may need to construct a character class if you have more specific requirements.
  • Nice! Thanks. I did not get it at first glance. Now I think I understand it better. I won't make progress in regexps with that but it solves my pb right away. I'm balanced between quick solution and learning. I'll try to do both. Really, as Pelr gurus say, there is more than one way to do it.
  • Thank you for @foos=($s=~/(?:(?<=\h)|^)(\w*foo\w*)(?=\h|$)/g); It worked for me. However I honestly did not understand half of your regexp... Would you be kind enough to explain it a little so that I can modify it for my own use(s) it in the future? TIA.
  • @GillesMaisonneuve: Explanation added.