Splitting a string into words and punctuation
text cleaning python
python split
replace punctuation in a string
nltk remove punctuation
python string filter punctuation
strip string punctuation not working
replace punctuation with space python
I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.
For instance:
>>> c = "help, me" >>> print c.split() ['help,', 'me']
What I really want the list to look like is:
['help', ',', 'me']
So, I want the string split at whitespace with the punctuation split from the words.
I've tried to parse the string first and then run the split:
>>> for character in c: ... if character in ".,;!?": ... outputCharacter = " %s" % character ... else: ... outputCharacter = character ... separatedPunctuation += outputCharacter >>> print separatedPunctuation help , me >>> print separatedPunctuation.split() ['help', ',', 'me']
This produces the result I want, but is painfully slow on large files.
Is there a way to do this more efficiently?
Dividing a string at various punctuation marks using split(), I'm trying to divide a string into words, removing spaces and punctuation marks. I tried using the split() method, passing all the punctuation at once Splitting words into a text file ; Splitting the string ; Java Split Text by Spaces and Punctuation ; convert to string: list with integers and strings [Basic] splitting a string ; problem in Splitting a string ; Splitting a string in c++ ; Splitting a String using a delimiter ; Splitting a string and converting to ints ; invalid string split char ; how to split strings ; Text slice and split made easy. Need help with Assignment.
Here is a Unicode-aware version:
re.findall(r"\w+|[^\w\s]", text, re.UNICODE)
The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']
); the second catches individual non-word characters, ignoring whitespace.
Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']
). This appears to be standard in NLP, so I consider it a feature.
How to Clean Text for Machine Learning with Python, Update Nov/2017: Fixed a code typo in the 'split into words' section, Python provides a constant called string.punctuation that provides a C++ - splitting a std::string into words 17 posts btmlltt "colour me brad" the popular if slightly flawed method is to split on spaces or punctuation. at a guess you would jump to your
In perl-style regular expression syntax, \b
matches a word boundary. This should come in handy for doing a regex-based split.
edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".
Splitting string into words and punctuation : learnpython, I have an assignment for uni where I'm supposed to write a function that should split a string into words and punctuation, Like this. Hello, my name is dsgdf. -> [' Split string by commas ignoring any punctuation marks (including ',') in quotation marks. How can I split string (from a textbox) by commas excluding those in double quotation marks (without getting rid of the quotation marks), along with other possible punctuation marks (e.g. ' .
Here's my entry.
I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).
>>> import re >>> import string >>> s = "Helo, my name is Joe! and i live!!! in a button; factory:" >>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0] >>> l ['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':'] >>>
One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.
Python, Additionally this article also includes the cases in which punctuation marks have to be ignored. Method #1 : Using split() Using split function, we can split the string The reason being is simple split by space is not enough to separate words from a string. Sentences may be separated by punctuation marks like dot, comma, question marks, etc. Sentences may be separated by punctuation marks like dot, comma, question marks, etc.
Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.
This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.
import string d = "Hello, I'm a string!" result = [] word = '' for char in d: if char not in string.whitespace: if char not in string.ascii_letters + "'": if word: result.append(word) result.append(char) word = '' else: word = ''.join([word,char]) else: if word: result.append(word) word = '' print result ['Hello', ',', "I'm", 'a', 'string', '!']
Split Strings into words with multiple word boundary delimiters , I think what I want to do is a fairly common task but I've found no reference on the the punctuation after I split with whitespace. Any ideas? 'Oh, you can't help that,' said the Cat: 'we're all mad here. The words in that line are: My advice: begin by defining an unambiguous lexical grammar and then write a lexer for that grammar that produces a sequence of tokens. Then reject the tokens that are not lexed into the "word" production. This isn't a job for regular expressions.
An elegant way to split text into words combined with adjacent , { wordFollowedByPunctuation: String; punctuationMark: PunctuationType; // E. g. {Point, Comma, Colon, Space, } } If all the punctuation marks Technically, you split a string into parts by using one or several different substrings as the boundaries of each part. For example, to split a sentence by the conjunctions " and " and " or ", expand the Split by strings group, and enter the delimiter strings, one per line:
Java StringTokenizer, Java StringTokenizer - strings, words, and punctuation marks. By Alvin If you're trying to break a text document or string down into words, this is a much more accurate approach than just using whitespace to separate words. NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens.
NLTK Tokenize: Split all punctuation into separate tokens , Write a Python NLTK program to split all punctuation into separate tokens a Python NLTK program to create a list of words from a given string. .split() splits mary on whitespce, and the returned result is a list of words in mary. This list contains 5 items as the len() function demonstrates. len() on mary, by contrast, returns the number of characters in the string (including the spaces).
Comments
- For this example (not the general case)
c.replace(' ','').partition(',')
- If you want to split at ANY punctuation, including
'
, tryre.findall(r"[\w]+|[^\s\w]", "Hello, I'm a string!")
. The result is['Hello', ',', 'I', "'", 'm', 'a', 'string', '!']
Note also that digits are included in the word match. - Sorry! could you explain how exactly this is working?
- @Curious: to be honest, no I coiuld not. Because, where should I start? What do you know? Which part is a problem for you? What do you want to achieve?
- Never mind! I understood this myself! Thanks for the reply :)
- Upvoted because the
\w+|[^\w\s]
construct is more generic than the accepted answer but afaik in python 3 the re.UNICODE shouldn't be necessary - What the hell? Is that a bug in re.split? In Perl,
split /\b\s*/
works without any problem. - it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.
- "kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.
- maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.
- It is not answering the question.
- plus 1 for grouping punctuation.
- i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.
- With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.
- You need to call
if word: result.append(word)
after the loop ends, else the last word is not in result.