Is there a generator version of `string.split()` in Python?

python split
python generator
python generator batch
split string and iterate python
python split string by character
python read generator in chunks
python split generator into chunks
python iterate lines from string

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

It is highly probable that re.finditer uses fairly minimal memory overhead.

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

edit: I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).

Treading on Python Volume 2: Intermediate Python, string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version? str_split(s, *delims, empty=None) Split the string s by the rest of the arguments, possibly omitting empty parts ( empty keyword argument is responsible for that). This is a generator function. When only one delimiter is supplied, the string is simply split by it. empty is then True by default.

The most efficient way I can think of it to write one using the offset parameter of the str.find() method. This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.

[edit 2016-8-2: updated this to optionally support regex separators]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want...

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

Lazy split and semi-lazy split, A fully lazy version isplit . And one that is semi-lazy where it consumes some of the generator, when moving to the next chunk, without fully  Possible Duplicate: Is there a generator version of string.split() in Python? str.split(delim) splits a string into a list of tokens, separated by delim. The entire list of tokens is returned

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

Programming Python, import string Linit _ _ . py ] Parser Generators If you have any background in parsing theory , you may know that neither regular expressions nor string splitting​  python string split indices. there's an updated version withou copying of the list and without zip. You can write a generator if you don't want to make any

Python Web Development with Django, nontrivial in size, it's smart to form a habit of using generator expressions instead of list comprehensions. Strings Another Python sequence type is the string,  NetLogo argues that one of its important features is that it activates agents from an agentset in pseudo-random order. If one wanted to do something similar in Python one might do the following. from

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

I'll just copy the docstring of the main str_split function:


str_split(s, *delims, empty=None)

Split the string s by the rest of the arguments, possibly omitting empty parts (empty keyword argument is responsible for that). This is a generator function.

When only one delimiter is supplied, the string is simply split by it. empty is then True by default.

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if empty is set to True, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters.

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespace is used, so the effect is the same as str.split(), except this function is a generator.

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

Text Processing in Python, The generator xreadlines. xreadlines() is an extremely fast and efficient file methods (string. join() and string. split () 120 BASIC STRING OPERATIONS XXI. A generator object is an object that returns a pointer? I apologize if my questions are unclear, but I basically wanted to ask the python veterans if my deductions were correct. My question is less about the observable results, and more so about the inner workings of python. I appreciate all of your help.

Есть ли версия генератора `string.split ()` в Python? Ru Python, def isplit(source, sep=None, regex=False): """ generator version of str.split() `​string.whitespace` is used, so the effect is the same as `str.split()`, except this  In general, generator expressions have an advantage that the memory usage is reduced as compared with the list comprehension. However, since join() internally converts a generator into a list, there is no advantage to using generator expressions. python - List vs generator comprehension speed with join function - Stack Overflow

¿Hay una versión del generador de `string.split()` en Python?, def isplit(source, sep=None, regex=False): """ generator version of str.split() `​string.whitespace` is used, so the effect is the same as `str.split()`, except this  There are many implementations out there, and while some have come close, none quite captured the elegance modern python affords. Tested using python(3.5.1) Included an additional list to demonstrate that it works when the numbers are mid string

4. Built-in Types, float also accepts the strings “nan” and “inf” with an optional prefix “+” or “-” for Not a it will automatically return an iterator object (technically, a generator object) Changed in version 3.3: For backwards compatibility with the Python 2 series, the Split the string at the first occurrence of sep, and return a 3-tuple containing  #! /usr/bin/python def mygen(n): x = 0 while x < n: x = x + 1 if x % 3 == 0: yield x for a in mygen(100): print a There is a loop in the generator that runs from 0 to n, and if the loop variable is a multiple of 3, it yields the variable. During each iteration of the for loop the generator is executed. If it is the first time the generator

Comments
  • This question might be related.
  • The reason is that it's very hard to think of a case where it's useful. Why do you want this?
  • @Glenn: Recently I saw a question about splitting a long string into chunks of n words. One of the solutions split the string and then returned a generator working on the result of split. That got me thinking if there was a way for split to return a generator to start with.
  • There is a relevant discussion on the Python Issue tracker: bugs.python.org/issue17343
  • @GlennMaynard it can be useful for really large bare string/file parsing, but anybody can write generator parser himself very easy using self-brewed DFA and yield
  • Excellent! I had forgotten about finditer. If one were interested in doing something like splitlines, I would suggest using this RE: '(.*\n|.+$)' str.splitlines chops off the trainling newline though (something that I don't really like...); if you wanted to replicated that part of the behavior, you could use grouping: (m.group(2) or m.group(3) for m in re.finditer('((.*)\n|(.+)$)', s)). PS: I guess the outer paren in the RE are not needed; I just feel uneasy about using | without paren :P
  • What about performance? re matching should be slower that ordinary search.
  • How would you rewrite this split_iter function to work like a_string.split("delimiter")?
  • split accepts regular expressions anyway so it's not really faster, if you want to use the returned value in a prev next fashion, look at my answer at the bottom...
  • str.split() does not accept regular expressions, that's re.split() you're thinking of...
  • why is this any better than re.finditer?
  • @ErikKaplun Because the regex logic for the items can be more complex than for their separators. In my case, I wanted to process each line individually, so I can report back if a line failed to match.
  • This is because memory is slower than cpu and in this case, the list is loaded by chunks where as all the others are loaded element by element. On the same note, many academics will tell you linked lists are faster and have less complexity while your computer will often be faster with arrays, which it finds easier to optimise. You can't assume an option is faster than another, test it ! +1 for testing.
  • The problem arise in the next steps of a processing chain. If you then want to find an specific chunk and ignore the rest when you find it, then you have the justification to use a generator based split instead of the built-in solution.