How to remove \xa0 from string in Python?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

How to remove this \xa0 from a string in python?, parsing Python: Removing xa0 from string? .decode('ascii', 'ignore'). @ MartijnPieters The linked unicode tutorial is good, but you are completely� Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting? I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0’s to u’s, so now I have “u”s everywhere instead.

There's many useful things in Python's unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.

parsing Python: Removing xa0 from string?, Soup to parse an HTML file and calling get_text() , but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an� Just use a list comprehension to replace the ending if a string contains '\xa0': res = [elem if '\xa0' not in elem else elem.replace('\xa0', '') for elem in lista] Your current approach merely re-assigns a name (element) over and over without actually modifying the list lista. The list comprehension will create a new list with elements from the

Try using .strip() at the end of your line line.strip() worked well for me

Python: Removing \xa0 from string?, Remove the \xa0, \t, \n in the string in python, Programmer Sought, the best programmer technical posts sharing site. Python Remove Character from String using translate() Python string translate() function replace each character in the string using the given translation table. We have to specify the Unicode code point for the character and ‘None’ as a replacement to remove it from the result string.

After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.

Assume we have our raw html as following:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

So lets try to clean this HTML string:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.

Method # 1 (Recommended): The first one is BeautifulSoup's get_text method with strip argument as True So our code becomes:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2: The other option is to use python's library unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

I have also detailed these methods on this blog which you may want to refer.

Remove the \xa0, \t, \n in the string in python, When 8-bit encoded string data and 16-bit raw Unicode string data gets mixed up , Python has a system-wide setting to enforce encoding of all unicode input if c == u"\xa0": print "Ufff" text = text.replace(u"\xa0", u" ") text = text.encode("utf-8")� python - non - remove xa0 xa0 Removing non-breaking spaces from strings using Python (3) I am having some trouble with a very basic string issue in Python (that I can't figure out).

try this:

string.replace('\\xa0', ' ')

Unicode encoding and decoding — Plone Documentation v5.2, parsing Python: Removing xa0 from string?, Soup to parse an HTML file and calling get_text() , but it seems like I'm being left with a lot of \xa0 Unicode� To Remove Character From String In Python, we can use string replace() or string translate() method. In Python, the string object is immutable and hence sometimes poses visible restrictions while coding the constructs that are required in day-day programming. This article presents the solution of removing the character from the string. We will

Python: Removing \xa0 from string?, line.decode('utf8').replace(u'\xa0', u' ') u'14 26 28 36 42 14' "Automate the Boring Stuff with Python" online course is free to sign up for the next few days with� The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways. To remove them properly, we can use two ways. Method # 1 (Recommended): The first one is BeautifulSoup's get_text method with strip argument as True So our code becomes:

Stripping unicode, return, and new lines from text file output?, word = u'Buffalo,\xa0IL\xa060625'. I don't want the "\xa0" in there. How can I get rid of it? The string I want is: word = 'Buffalo, IL 06025. Python. Python Remove Spaces from String. Python String is immutable, so we can’t change its value. Any function that manipulates string value returns a new string and we have to explicitly assign it to the string, otherwise, the string value won’t change.

How to remove this \xa0 from a string in python?, Solution can be found here - https://stackoverflow.com/questions/10993612/ python-removing-xa0-from-string either string = string.replace(u'\xa0', u' ') or new_str� How to delete a character from a string using python? Python Server Side Programming Programming If you want to delete a character at a certain index from the string, you can use string slicing to create a string without that character.

Comments
  • tried that already, 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
  • embrace Unicode. Use u''s instead of ''s. :-)
  • tried using str.replace(u'\xa0', ' ') but got "u"s everywhere instead of \xa0s :/
  • If the string is the unicode one, you have to use the u' ' replacement, not the ' '. Is the original string the unicode one?
  • I don't know a huge amount about Unicode and character encodings.. but it seems like unicodedata.normalize would be more appropriate than str.replace
  • Yours is workable advice for strings, but note that all references to this string will also need to be replaced. For example, if you have a program that opens files, and one of the files has a non-breaking space in its name, you will need to rename that file in addition to doing this replacement.
  • U+00a0 is a non-breakable space Unicode character that can be encoded as b'\xa0' byte in latin1 encoding, as two bytes b'\xc2\xa0' in utf-8 encoding. It can be represented as &nbsp; in html.
  • When I try this, I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 397: ordinal not in range(128).
  • Was stuck in for 1hour and finally solved. Thanks a lot.
  • this is brilliant. This should be the accepted answer.
  • Totally agree. Easy, clear, short and to the point solution. Thumbs up.
  • Not so sure, you may want normalize('NFKD', '1º\xa0dia') to return '1º dia' but it returns '1o dia'