python re.split(): how to save some of the delimiters (instead of all the delimiter by using bracket)

re.sub python 3
python regex tester
python regex cheat sheet
python pattern matching
re.verbose python
re.count python
python regex match word in string
python regex extract

For the sentences:

"I am very hungry,    so mum brings me a cake!

I want it split by delimiters, and I want all the delimiters except space to be saved as well. So the expected output is :

"I"  "am"  "very"  "hungry"   ","   "so", "mum"  "brings"  "me"   "a"   "cake"    "!"    "\n"

What I am currently doing is re.split(r'([!:''".,(\s+)\n])', text), which split the whole sentences but also saved a lot of space characters which I don't want. I've also tried the regular expression \s|([!:''".,(\s+)\n]) , which gives me a lot of None somehow.

search or findall might be more appropriate here than split:

import re

s = "I am very hungry,    so mum brings me a !#$#@  cake!"

print(re.findall(r'[^\w\s]+|\w+', s))

# ['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', '!#$#@', 'cake', '!']

The pattern [^\w\s]+|\w+ means: a sequence of symbols which are neither alphanumeric nor whitespace OR a sequence of alphanumerics (that is, a word)

6.2. re — Regular expression operations, The solution is to use Python's raw string notation for regular expression of the regular expression, instead of passing a flag argument to the re.compile() function. The contained pattern must only match strings of some fixed length, meaning If capturing parentheses are used in pattern, then the text of all groups in the  If you want to split a string that matches a regular expression instead of perfect match, use the split() of the re module. re.split() — Regular expression operations — Python 3.7.3 documentation; In re.split(), specify the regular expression pattern in the first parameter and the target character string in the second parameter.

That is because your regular expression contains a capture group. Because of that capture group, it will also include the matches in the result. But this is likely what you want.

The only challenge is to filter out the Nones (and other values with truthiness False) in case there is no match, we can do this with:

def tokenize(text):
    return filter(None, re.split(r'[ ]+|([!:''".,\s\n])', text))

For your given sample text, this produces:

>>> list(tokenize("I am very hungry,    so mum brings me a cake!\n"))
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']

Regular Expression HOWTO, You can also use REs to modify a string or to split it apart in various ways. Instead, they signal that some out-of-the-ordinary thing should be matched, If the regex pattern is a string, \w will match all the characters marked as the text between delimiters is, but also need to know what the delimiter was. Python | Split on last occurrence of delimiter The splitting of string has always been discussed in various applications and use cases. One of the interesting variation of list splitting can be splitting the list on delimiter but this time only on the last occurrence of it.

One approach is to surround the special characters (,!.\n) with space and then split on space:

import re


def tokenize(t, pattern="([,!.\n])"):
    return [e for e in re.sub(pattern, r" \1 ", t).split(' ') if e]


s = "I am very hungry,    so mum brings me a cake!\n"

print(tokenize(s))

Output

['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']

7.2. re — Regular expression operations, If you're not using a raw string to express the pattern, remember that If the first character of the set is '^' , all the characters that are not in the set will be matched. Matches whatever regular expression is inside the parentheses, and The contained pattern must only match strings of some fixed length,  Python | Ways to split strings using newline delimiter Given a string, write a Python program to split strings on the basis of newline delimiter. Given below are a few methods to solve the given task.

Python Exercises: Split a string with multiple delimiters, Write a Python program to split a string with multiple delimiters. An example of a delimiter is the comma character, which acts as a field import re text = 'The quick brown\nfox jumps*over the lazy dog. Resetting will undo all of your current changes. Test your Python skills with w3resource's quiz. How to use Split in Python The split() method in Python returns a list of the words in the string/line , separated by the delimiter string. This method will return one or more new strings. All substrings are returned in the list datatype. Syntax

Python split string by or, To split a text string at a certain character, you can use a combination of the LEFT​, Python string method split () returns a list of all the words in the string, using str 26 Feb 2020 Write a Python program to split a string with multiple delimiters. how to split a string by a regular expression delimiter using re python package. @sunjay, I like that, but sometimes you want to split by more than one character at a time.

Tutorial: Python Regex (Regular Expressions) for Data Scientists, In this Python regex tutorial, learn how to use regular expressions and the pandas and working with text-based data sets much easier, saving you the trouble of While re.findall() matches all instances of a pattern in a string and returns them split() function to snip each address in half, with the @ symbol as the delimiter. 7.2.1. Regular Expression Syntax¶. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Comments
  • Almost(?) a duplicate of In Python, how do I split a string and keep the separators?.
  • could you explain a little bit how the pattern is constructed this way ? why [^\w\s]+ gives all the words but not words with space character (as \s suggested)? Also why you put |w+ pattern there ?
  • @SoManyProblems: added an explanation
  • why adding the [ ]+| into the reg expression would lead to generating a lot of Nones ? I
  • @SoManyProblems: because if the capture group (the part in the parenthesis), does not matches anything, it still introduces a None for "empty" capture groups. If you generate multiple parenthesis, this can even result in a lot of extra elements.
  • thanks a lot for the reply. Just to confirm that I understand you correctly, do you mean that [ ]+ matches the space, so it did the split work, and because it doesn't have a (), so it would returns None back ?
  • @SoManyProblems: the regex itself has a (...), a capture group. But since that capture group is not "activated" (it ddoes not matches anything), it captures None.