Splitting a string into words and punctuation

remove punctuation from string python
text cleaning python
python split
replace punctuation in a string
nltk remove punctuation
python string filter punctuation
strip string punctuation not working
replace punctuation with space python

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

Dividing a string at various punctuation marks using split(), I'm trying to divide a string into words, removing spaces and punctuation marks. I tried using the split() method, passing all the punctuation at once  Splitting words into a text file ; Splitting the string ; Java Split Text by Spaces and Punctuation ; convert to string: list with integers and strings [Basic] splitting a string ; problem in Splitting a string ; Splitting a string in c++ ; Splitting a String using a delimiter ; Splitting a string and converting to ints ; invalid string split char ; how to split strings ; Text slice and split made easy. Need help with Assignment.

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

How to Clean Text for Machine Learning with Python, Update Nov/2017: Fixed a code typo in the 'split into words' section, Python provides a constant called string.punctuation that provides a  C++ - splitting a std::string into words 17 posts btmlltt "colour me brad" the popular if slightly flawed method is to split on spaces or punctuation. at a guess you would jump to your

In perl-style regular expression syntax, \b matches a word boundary. This should come in handy for doing a regex-based split.

edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".

Splitting string into words and punctuation : learnpython, I have an assignment for uni where I'm supposed to write a function that should split a string into words and punctuation, Like this. Hello, my name is dsgdf. -> ['  Split string by commas ignoring any punctuation marks (including ',') in quotation marks. How can I split string (from a textbox) by commas excluding those in double quotation marks (without getting rid of the quotation marks), along with other possible punctuation marks (e.g. ' .

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

Python, Additionally this article also includes the cases in which punctuation marks have to be ignored. Method #1 : Using split() Using split function, we can split the string​  The reason being is simple split by space is not enough to separate words from a string. Sentences may be separated by punctuation marks like dot, comma, question marks, etc. Sentences may be separated by punctuation marks like dot, comma, question marks, etc.

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

Split Strings into words with multiple word boundary delimiters , I think what I want to do is a fairly common task but I've found no reference on the the punctuation after I split with whitespace. Any ideas? 'Oh, you can't help that,' said the Cat: 'we're all mad here. The words in that line are: My advice: begin by defining an unambiguous lexical grammar and then write a lexer for that grammar that produces a sequence of tokens. Then reject the tokens that are not lexed into the "word" production. This isn't a job for regular expressions.

An elegant way to split text into words combined with adjacent , { wordFollowedByPunctuation: String; punctuationMark: PunctuationType; // E. g. {Point, Comma, Colon, Space, } } If all the punctuation marks  Technically, you split a string into parts by using one or several different substrings as the boundaries of each part. For example, to split a sentence by the conjunctions " and " and " or ", expand the Split by strings group, and enter the delimiter strings, one per line:

Java StringTokenizer, Java StringTokenizer - strings, words, and punctuation marks. By Alvin If you're trying to break a text document or string down into words, this is a much more accurate approach than just using whitespace to separate words. NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens.

NLTK Tokenize: Split all punctuation into separate tokens , Write a Python NLTK program to split all punctuation into separate tokens a Python NLTK program to create a list of words from a given string. .split() splits mary on whitespce, and the returned result is a list of words in mary. This list contains 5 items as the len() function demonstrates. len() on mary, by contrast, returns the number of characters in the string (including the spaces).

Comments
  • For this example (not the general case) c.replace(' ','').partition(',')
  • If you want to split at ANY punctuation, including ', try re.findall(r"[\w]+|[^\s\w]", "Hello, I'm a string!"). The result is ['Hello', ',', 'I', "'", 'm', 'a', 'string', '!'] Note also that digits are included in the word match.
  • Sorry! could you explain how exactly this is working?
  • @Curious: to be honest, no I coiuld not. Because, where should I start? What do you know? Which part is a problem for you? What do you want to achieve?
  • Never mind! I understood this myself! Thanks for the reply :)
  • Upvoted because the \w+|[^\w\s] construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary
  • What the hell? Is that a bug in re.split? In Perl, split /\b\s*/ works without any problem.
  • it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.
  • "kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.
  • maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.
  • It is not answering the question.
  • plus 1 for grouping punctuation.
  • i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.
  • With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.
  • You need to call if word: result.append(word) after the loop ends, else the last word is not in result.