Requests module encoding provides different encode then HTML encode

The request module encoding provides different encoding then the actual set encoding in HTML page

Code:

import requests
URL = "http://www.reynamining.com/nuevositio/contacto.html"
obj = requests.get(URL, timeout=60, verify=False, allow_redirects=True)
print obj.encoding

Output:

ISO-8859-1

Where as the actual encoding set in the HTML is UTF-8 content="text/html; charset=UTF-8"

My Question are:

  1. Why is requests.encoding showing different encoding then the encoding described in the HTML page?.

I am trying to convert the encoding into UTF-8 using this method objReq.content.decode(encodes).encode("utf-8") since it is already in UTF-8 when I do decode with ISO-8859-1 and encode with UTF-8 the values get changed i.e.) áchanges to this Ã

Is there any way to convert all type of encodes into UTF-8 ?

Requests sets the response.encoding attribute to ISO-8859-1 when you have a text/* response and no content type has been specified in the response headers.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

You can test for this by looking for a charset parameter in the Content-Type header:

resp = requests.get(....)
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None

Your HTML document specifies the content type in a <meta> header, and it is this header that is authoritative:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

HTML 5 also defines a <meta charset="..." /> tag, see <meta charset="utf-8"> vs <meta http-equiv="Content-Type">

You should not recode HTML pages to UTF-8 if they contain such a header with a different codec. You must at the very least correct that header in that case.

Using BeautifulSoup:

# pass in explicit encoding if set as a header
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
content = resp.content
soup = BeautifulSoup(content, from_encoding=encoding)
if soup.original_encoding != 'utf-8':
    meta = soup.select_one('meta[charset], meta[http-equiv="Content-Type"]')
    if meta:
        # replace the meta charset info before re-encoding
        if 'charset' in meta.attrs:
            meta['charset'] = 'utf-8'
        else:
            meta['content'] = 'text/html; charset=utf-8'
    # re-encode to UTF-8
    content = soup.prettify()  # encodes to UTF-8 by default

Similarly, other document standards may also specify specific encodings; XML for example is always UTF-8 unless specified by a <?xml encoding="..." ... ?> XML declaration, again part of the document.

Response.text returns improperly decoded text (requests 1.2.3 , If http server returns Content-type: text/* without encoding, http://lavr.github.io/​python-emails/tests/requests/some-utf8-text.html. All browsers even easier than that, you can simply provide the encoding yourself. Then, proceed normally. Use response.content instead of response.text.encode("utf-8")? Why is requests.encoding showing different encoding then the encoding described in the HTML page?. I am trying to convert the encoding into UTF-8 using this method objReq.content.decode(encodes).encode("utf-8") since it is already in UTF-8 when I do decode with ISO-8859-1 and encode with UTF-8 the values get changed i.e.) á changes to this Ã

Requests will first check for an encoding in the HTTP header:

print obj.headers['content-type']

output:

text/html

does not correctly parse the type of encoding guess therefore it specifies default ISO-8859-1.

see more in docs .

Quickstart, This page gives a good introduction in how to get started with Requests. What about the other HTTP request types: PUT, DELETE, HEAD and OPTIONS? You can see that the URL has been correctly encoded by printing the URL: For example, HTML and XML have the ability to specify their encoding in their body. Online HTML Encode: HTML Encode will be encode HTML and is way to hide HTML text from prying eyes. Countwordsfree HTML Encoder is an online tool that converts HTML code into JavaScript Unicode string which means the text looks scrambled when your source code is viewed, but when executed as a web page, appears to be normal.

Requests replies on the HTTP Content-Type response header and chardet. For the common case of text/html, it assumes a default of ISO‌-8859-1. The issue is that Requests doesn't know anything about HTML meta tags, which can specify a different text encoding, e.g. <meta charset="utf-8"> or <meta http-equiv="content-type" content="text/html; charset=UTF‌-8">.

A good solution is to use BeautifulSoup's "Unicode, Dammit" feature, like this:

from bs4 import UnicodeDammit
import requests


url = 'http://www.reynamining.com/nuevositio/contacto.html'
r = requests.get(url)

dammit = UnicodeDammit(r.content)
r.encoding = dammit.original_encoding

print(r.text)

Advanced Usage, A Session object has all the methods of the main Requests API. 2012 12:59:39 GMT', 'content-type': 'text/html; charset=UTF-8', 'x-cache-lookup': 'HIT You then send that with the other parameters you would have sent to requests. To send a chunk-encoded request, simply provide a generator (or any iterator without a  How to URL encode API requests When URL formatting was specified in 1994 , the character set was limited to a subset of available characters. To accommodate this, URL encoding was created as a way of formatting information in a Uniform Resource Identifier (URI) using specific parameters.

codecs — Codec registry and base classes, Most standard codecs are text encodings, which encode text to bytes, but To simplify access to the various codec components, the module provides to the given data_encoding and then written to the original file as bytes using file_encoding. The read() method will never return more data than requested, but it might  The HTML character encoder converts all applicable characters to their corresponding HTML entities. Certain characters have special significance in HTML and should be converted to their correct HTML entities to preserve their meanings. For example, it is not possible to use the < character as it is used in the HTML syntax to create and close tags.

HTTP for Humans, Python's standard urllib2 module provides most of the HTTP capabilities you need, but It was built for a different time — and a different web. r.headers['​content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' There's no need to manually add query strings to your URLs, or to form-encode your POST data. For encodings that are implemented by the Encode::XS module, CHECK == Encode::FB_PERLQQ puts encode and decode into perlqq fallback mode. When you decode, \x HH is inserted for a malformed character, where HH is the hex representation of the octet that could not be decoded to utf8.

MERN Quick Start Guide: Build web applications with MongoDB, , The first one is a URL encoded form while the other one will encode its body as plain text. 1. Then, initialize a new ExpressJS application: const express URL encoded requests and text plain requests: app.use(bodyParser.urlencoded({ Serve HTML content with two forms that submit data using different encodings:  HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as &lt; and &gt; for HTTP transmission. To encode or decode values outside of a web application, use the WebUtility class. See also

Comments
  • the given snippets produces None type error for URL like http://www.uraniumenergy.com/contact_us/contact_information could you please say why this occurs and how to avoid it?.
  • @The6thSense: no idea; I don't get any error when I try it. Do you have a traceback?
  • Sorry for the late reply. I have added the traceback to the question and when I did dir(soup) I did not get select_one I fell this is causing the error.
  • @The6thSense: upgrade BeautifulSoup; that method is rather new (added in 4.4.0, released July 2015).
  • @The6thSense: alternatively, use soup.select(...) then use the first element if the returned list is not empty.