How to use awk variables in regular expressions?

awk regex
awk variable in pattern
awk gsub
awk substitute variable
awk regex capture
awk regex tester
awk match regex with variable
awk external variable

I have a file called domain which contains some domains. For example:

google.com
facebook.com
...
yahoo.com

And I have another file called site which contains some sites URLs and numbers. For example:

image.google.com   10
map.google.com     8
...
photo.facebook.com  22
game.facebook.com   15
..

Now I'm going to count the url number each domain has. For example: google.com has 10+8. So I wrote an awk script like this:

BEGIN{
  while(getline dom < "./domain" > 0) {
    domain[dom]=0;
  }
  for(dom in domain) {
    while(getline < "./site" > 0) {
      if($1 ~/$dom$)   #if $1 end with $dom {
        domain[dom]+=$2;
      }
    }
  }
}

But the code if($1 ~/$dom$) doesn't run like I want. Because the variable $dom in the regular expression was explained literally. So, the first question is:

Is there any way to use variable $dom in a regular expression?

Then, as I'm new to writing script

Is there any better way to solve the problem I have?


awk can match against a variable if you don't use the // regex markers.

if ( $0 ~ regex ){ print $0; }

In this case, build up the required regex as a string

regex = dom"$"

Then match against the regex variable

if ( $1 ~ regex ) {
  domain[dom]+=$2;
}

awk variables in regex expression ?, When a pattern matching is done using ~ we need to give the reg-exp in between / / (2 forward slashes).Still, when given inside the slashes it is taken as literally,  AWK is very powerful and efficient in handling regular expressions. A number of complex tasks can be solved with simple regular expressions. Any command-line expert knows the power of regular expressions. This chapter covers standard regular expressions with suitable examples. Dot. It matches any single character except the end of line character.


First of all, the variable is dom not $dom -- consider $ as an operator to extract the value of the column number stored in the variable dom

Secondly, awk will not interpolate what's between // -- that is just a string in there.

You want the match() function where the 2nd argument can be a string that is treated as the regular expression:

if (match($1, dom "$")) {...}

I would code a solution like:

awk '
  FNR == NR {domain[$1] = 0; next}
  {
    for (dom in domain) {
      if (match($1, dom "$")) {
        domain[dom] += $2
        break
      }
    }
  }
  END {for (dom in domain) {print dom, domain[dom]}}
' domain site 

How to use regular expressions in awk, The syntax for using regular expressions to match lines in awk is: strings with a new string, whether the new string is a string or a variable. Using Awk with (*) Character in a Pattern. It will match strings containing localhost, localnet, lines, capable, as in the example below: # awk ' /l*c/ {print}' /etc/localhost. Use Awk to Match Strings in File. You will also realize that (*) tries to a get you the longest match possible it can detect.


One way using an awk script:

BEGIN {
    FS = "[. ]"
    OFS = "."
}

FNR == NR {
    domain[$1] = $0
    next
}

FNR < NR {
    if ($2 in domain) {
        for ( i = 2; i < NF; i++ ) {
            if ($i != "") {
                line = (line ? line OFS : "") $i
            }
        }
        total[line] += $NF
        line = ""
    }
}

END {
    for (i in total) {
        printf "%s\t%s\n", i, total[i]
    }
}

Run like:

awk -f script.awk domain.txt site.txt

Results:

facebook.com    37
google.com  18

Pass shell variable as a /pattern/ to awk, Use awk's ~ operator, and you don't need to provide a literal regex on the right-​hand side: function _process () { awk -v l="$line" -v pattern="$1"  In AWK, regular expressions are enclosed in forward slashes, '/', (forming the AWK pattern) and match every input record whose text belongs to that set. The simplest regular expression is a string of letters, numbers, or both that matches itself.


You clearly want to read the site file once, not once per entry in domain. Fixing that, though, is trivial.

Equally, variables in awk (other than fields $0 .. $9, etc) are not prefixed with $. In particular, $dom is the field number identified by the variable dom (typically, that's going to be 0 since domain strings don't convert to any other number).

I think you need to find a way to get the domain from the data read from the site file. I'm not sure if you need to deal with sites with country domains such as bbc.co.uk as well as sites in the GTLDs (google.com etc). Assuming you are not dealing with country domains, you can use this:

BEGIN {
    while (getline dom < "./domain" > 0) domain[dom] = 0
    FS = "[ .]+"
    while (getline  < "./site" > 0)
    {
        topdom = $(NF-2) "." $(NF-1)
        domain[topdom] += $NF          
    }
    for (dom in domain) print dom "  " domain[dom]
}

In the second while loop, there are NF fields; $NF contains the count, and $1 .. $(NF-1) contain components of the domain. So, topdom ends up containing the top domain name, which is then used to index into the array initialized in the first loop.

Given the data in the question (minus the lines of dots), the output is:

yahoo.com  0
facebook.com  37
google.com  18

AWK : how to refer to a variable inside regular expression, But My question is can we use the parameter/variable as a part of regulation expression ? A. Anonymous. 13 years ago. reply to this. A regular expression enclosed in slashes (`/')is an awkpatternthat matches every input record whose textbelongs to that set. The simplest regular expression is a sequence of letters, numbers, orboth. Such a regexpmatches any stringthat contains that sequence. Thus, the regexp`foo'matches any stringcontaining `foo'.


The problem of the answers above is that you cannot use the "metacharacters" (e.g. \< for a word boundary at the beginning of a word) if you use a string instead of a regular expression /.../. If you had a domain xyz.com and two sites ab.xyz.com and cd.prefix_xyz.com, the numbers of the two site entries would be added to xyz.com

Here's a solution using awk's pipe and the sed command: ...

for(dom in domain) {
    while(getline < "./site" > 0) {
        # let sed replaces occurence of the domain at the end of the site
        cmd = "echo '" $1 "' | sed 's/\\<'" dom "'$/NO_VALID_DOM/'"
        cmd | getline x
        close(cmd)
        if (match(x, "NO_VALID_DOM")) { 
          domain[dom]+=$2;
        }
    }
    close("./site") # this misses in original code
}

...

AWK: the substr command to select a substring, (s, a, b) : it returns b number of chars from string s, starting at position a. There are two ways to use a pattern that is contained in an awk string variable. (If I declare a pattern.RE in a variable, I name the variable reSomething, like: reWord = "^[A-Za-z0-9_]+$"; Then the two available syntaxes are: (a) someVar ~ reWord. That just returns a boolean result -- 1 for match, 0 for not-match (b) match (someVar, reWord)


NR - Number of Records in AWK in UNIX, ) allow for dynamic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns. First of all, the variable is domnot $dom-- consider $as an operator to extract the value of the column number stored in the variable dom Secondly, awk will not interpolate what's between //-- that is just a string in there. You want the match()function where the 2nd argument can be a string that is treated as the regular expression:


String Functions, BEGIN or END. Boolean combinations of regular expressions using the operators ! The awk program uses variables to manipulate information. Variables are  Learn Awk Variables, Numeric Expressions and Assignment Operators These concepts are not comprehensively distinct from the ones you may have probably encountered in many programming languages before such shell, C, Python plus many others, so there is no need to worry much about this topic, we are simply revising the common ideas of using these


Matching Patterns and Processing Information with awk, Sculpting text with regex, grep, sed, awk, emacs and vim (Mathematicians don't typically put quotes around a string, preferring to let the fixed-width typewriter font As in C, variable assignment is an expression rather than a statement. As you can see, awk now only prints the lines that have the characters “sand” in them. Using regular expressions, you can target specific parts of the text. To display only the line that starts with the letters “sand”, use the regular expression ^sand: awk '/^sand/' favorite_food.txt.