Using RegEx to find number followed by dot

regex python
python pattern matching
regex match float or integer
regular expression for floating point numbers in javascript
regex match number
regex to allow only numbers and single dot in javascript
regular expression for exponential number
regex match specific number

I am trying to find the index of the reference in a list of references. Let me illustrate:

This is a list of references I scraped off a website:

ref = "<p class="references" style="font-size:15px">1. Mcminn. (2003). Last's Anatomy. Elsevier Australia. ISBN:0729537528. <a href="http://books.google.com/books?vid=ISBN0729537528">Read it at Google Books</a> - <a href="http://www.amazon.com/gp/product/0729537528">Find it at Amazon</a><br>
2. Netter, F. H. (2019). Atlas of human anatomy. Philadelphia, PA: Elsevier.</p>"

I thought I could get the index of reference (i.e. "1." and "2.") by using this:

result = list(map(int, [e for e in re.split("[^0-9]", ref) if e != '']))

But I'm getting all numbers: [1, 2003, 729537528, 2, 2019]

How do I only get the list of reference index, i.e. [1, 2] One way I guess is to find numbers followed by a dot, but I don't know how.

Example: Matching Floating Point Numbers with a Regular Expression, Here is a better attempt: [-+]?([0-9]*\. [0-9]+|[0-9]+). This regular expression matches an optional sign, that is either followed by zero or more digits followed by a dot and one or more digits (a floating point number with optional integer part), or that is followed by one or more digits (an integer). [.,] means either a dot or a comma symbol . And you just need to concatenate the pieces together, so in the case where you want to represent a 1 or 2 digit number followed by either a comma or a period and followed by two more digits you would express it as follows: > [0-9]{1,2}[.,]\d{1-2}

You can try this:

import re
o = re.findall(r'[>|\s](\d{1})\.', ref)
print(o)

Will output:

['1', '2']

You might need to define a bit more structure, because just number (digit captured by \d) and dot will also capture '8.' at the end of the ISBN number: ISBN:0729537528. Here I used a few characters that (in this example) help distinguishing the two cases. One reference is preceded by a '>' the other one by a space (\s).

Python Regular Expression, To use regular expression you need to import re module. import re. Now you are \d in regular expression matches a single digit, so. \d\d\d will  Regular expressions are built from single characters, using union, concatenation, and the Kleene closure, or any-number-of, operator (Compilers: Principles, Techniques and Tools). A summary. Regular expressions are a concise way to process text data.

You have to "escape" the period so something like "[0-9]*\." should work. That's off the top of my head so it may be slightly wrong; I'll also leave it up to you to figure out why * is there.

Be aware that Regex expressions in Python are slightly different from other implementations. For definitive info see:

See : https://docs.python.org/3/library/re.html

which suggests that you should start here:

https://docs.python.org/3/howto/regex.html#regex-howto

Here's the relevant section of the library page (about 1/ 3 the way down):

The special sequences consist of '\' and a character from the list below. If the ordinary character is not an ASCII digit or an ASCII letter, then the resulting RE will match the second character. For example, \$ matches the character '$'.

For the eqivalent python 2.x page change the version selector found at the top left corner of the page.

Quantifiers +, *, ? and {n}, To find numbers from 3 to 5 digits we can put the limits into curly braces: \d{3,5} The regexp looks for character '<' followed by one or more Latin letters, and then Create a regexp to find ellipsis: 3 (or more?) dots in a row. Match a white space followed by one or more decimal digits, followed by zero or one period or comma, followed by zero or more decimal digits. This is the first capturing group. Because the replacement pattern is $1 , the call to the Regex.Replace method replaces the entire matched substring with this captured group.

How to match digits followed by a dot using sed?, Because sed is not perl -- sed regexes do not have a \d shorthand: sed 's/[[:digit:]]\​+\.//g'. sed regular expression documentation here. Regular Expression (Regex) Syntax. A Regular Expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern. In other words, a regex accepts a certain set of strings and rejects the rest.

Regular Expression (Regex) Tutorial, To match a character having special meaning in regex, you need to use a escape sequence (dot): ANY ONE character except newline. Take note that this regex matches number with leading zeros, such as "000" , "0123" and Begin with one letters or underscore, followed by zero or more digits, letters and underscore. Consider a simple regular expression that is intended to extract the last four digits from a string of numbers such as a credit card number. The version of the regular expression that uses the * greedy quantifier is \b.*([0-9]{4})\b. However, if a string contains two numbers, this regular expression matches the last four digits of the second

Dot (.) and backslash (\) - Analytics Help, is used to separate the whole part of a number from the fractional part. If you were to provide an IP address as a regular expression, you would get Use the backslash to escape any special character and interpret it literally; for example:.

Comments
  • Try result = list(map(int, re.findall(r"([0-9]+)\. ", p.text)))
  • Thanks for that! I made a mistake using 'p.text' instead of 'ref' as the variable. I corrected it above.