Correctly reading text from Windows-1252(cp1252) file in python

convert cp1252 to utf-8 python
python read file encoding
python codecs
python file encoding
python write unicode to file
processing text files in python 3
python decode
decode text file python

so okay, as the title suggests the problem I have is with correctly reading input from a windows-1252 encoded file in python and inserting said input into SQLAlchemy-MySql table.

The current system setup: Windows 7 VM with "Roger Access Control System" which outputs the file; Ubuntu 12.04 LTS VM with a shared-folder to the Windows system so I can access the file, using "Python 2.7.3".

Now to the actual problem, for the input file I have a "VM shared-folder" that contains a file that is genereate on a Windows 7 system through Roger Access Control System(roger.pl for more details), this file is called "PREvents.csv" which suggests to it's contents, a ";" seperated list of data.

An example format of the data:

2013-03-19;15:58:30;100;Jānis;Dumburs;1;Uznemums1;0;Ieeja;
2013-03-19;15:58:40;100;Jānis;Dumburs;1;Uznemums1;2;Izeja;

The 4th field contains the card owners name and 5th contains the owners lastname, the 6th contains the owners assigned group.

The issue comes from the fact that any one of the 3 above mentioned fields can contain characters specific to Latvian language, in the example file the word "Jānis" contains the letter "ā" which in unicode is 257.

As I'm used to, I open the file as such:

try:
    f = codecs.open(file, 'rb', 'cp1252')
except IOError:
    f = codecs.open(file, 'wb', 'cp1252')

So far, everything works - it opens the file and so I move on to iterate over each line of the file(this is a continuos running script so pardon the loop):

while True:
    line = f.readline()

    if not line:
        # Pause loop for 1 second
        time.sleep(1)
    else:
        # Split the line into list
        date, timed, userid, firstname, lastname, groupid, groupname, typed, pointname, empty = line.split(';')

And this is where the issues start, if I print repr(firstname) it prints u'J\xe2nis' which is, as far as I undestand, not correct - `\xe2\ does not represent the Latvian character "ā". Further down the loop depending on event type I assign the variables to SQLAlchemy object and insert/update:

if typed == '0':  # Entry type
    event = Events(
        period,
        fullname,
        userid,
        groupname,
        timestamp,
        0,
        0
    )
    session.add(event)
else:  # Exit type
    event = session.query(Events).filter(
        Events.period == period,
        Events.exit == 0,
        Events.userid == userid
    ).first()
    if event is not None:
        event.exit = timestamp
        event.spent = timestamp - event.entry

# Commit changes to database
session.commit()

In my search for answers I've found how to define the default encoding to use:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Which hasn't helped me in any way.

Basically, this is all leads to the me not being able to insert the correct owners First/last name aswell as owners assigned groupname if they contain any of Latvian-specific characters, for example:

Instead of the character "ā" it inserts "â"

I'd also like to add that I cannot change the "PREvents.csv" file encoding and the "RACS" system does not support inserting into UTF-8 or Unicode files - if you try either way, the system inserts random symbols for the Latvian-specific characters.

Please let me now if any other information is needed, I'll gladly provide it :)

Any help would be highly appreciated.


CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x:

>>> print(repr(b'J\xe2nis'.decode('cp1252')))
u'J\xe2nis'
>>> print(b'J\xe2nis'.decode('cp1252'))
Jânis

utf 8 Correctly reading text from Windows 1252(cp1252) file in python?, Are you getting actual errors from SQLAlchemy or in your application's output? If I try firstname = firstname.decode('cp1252') and then insert that into database I  Correctly reading text from Windows-1252 (cp1252) file in python (2) CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x: >>> print(repr(b'J\xe2nis'.decode('cp1252'))) u'J\xe2nis' >>> print(b'J\xe2nis'.decode('cp1252')) Jânis.


I think u'J\xe2nis' is correct, see:

>>> print u'J\xe2nis'.encode('utf-8')
Jânis

Are you getting actual errors from SQLAlchemy or in your application's output?

utf 8, u'J\xe2nis' >>> print u'J\xe2nis'.encode('utf-8') Jnis. Are you getting actual errors from SQLAlchemy or in your application's output? If I try firstname  The image of the data you posted is just that - an image. It says nothing about the file's raw format. Is it a UTF8 file? UTF16? It's definitely not CP1252. Neither UTF8 nor CP1252 would produce NANs either. Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding.


I had the same problem with some XML files, I solved reading the file with ANSI encoding (Windows-1252) and writing a file with UTF-8 encoding:

import os
import sys

path = os.path.dirname(__file__)

file_name = 'my_input_file.xml'

if __name__ == "__main__":
    with open(os.path.join(path, './' + file_name), 'r', encoding='cp1252') as f1:
        lines = f1.read()
        f2 = open(os.path.join(path, './' + 'my_output_file.xml'), 'w', encoding='utf-8')
        f2.write(lines)
        f2.close()

Handling Unicode Strings in Python, How do I read a Unicode text file in Python? I recently ran into some problems decoding a handle (with errors mapping 0x81, 0x8D) from the Biopython module with an anaconda 4.1.1 python 3.5.2 installation on a sony vaio windows 10 system. After some research, it seems that possibly the problem may be that the default decoding codec is cp1252.


Solving Unicode Problems in Python 2.7, objects with decode() / u” before handling them. >>> Processing Text Files in Python 3¶. A recent discussion on the python-ideas mailing list made it clear that we (i.e. the core Python developers) need to provide some clearer guidance on how to handle text processing tasks that trigger exceptions by default in Python 3, but were previously swept under the rug by Python 2’s blithe assumption that all files are encoded in “latin-1”.


Get encoding of a file in Windows, How do I change the encoding of a text file? Python 3 is all-in on Unicode and UTF-8 specifically. Here’s what that means: Python 3 source code is assumed to be UTF-8 by default. This means that you don’t need # -*- coding: UTF-8 -*-at the top of .py files in Python 3. All text (str) is Unicode by default. Encoded Unicode text is represented as binary data (bytes).


Processing Text Files in Python 3, (Note that Windows has it's own “latin-1” variant called cp1252, but, unlike the ISO that require an encoding (e.g. reading in a text file without a specified encoding). back into the exact original byte sequence that failed to decode correctly. Hello, I need to append a string to a text file that's encoded in UTF-8. It appears that, by default, Python 3 tries to write in ANSI (Latin-1, ISO8859-1, cp1252, or what ever is the correct name). As a result, I end up with a file that cannot be c