This question already has an answer here:
I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.
The code is like this:
the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:
I could fix it using this quite ugly code:
Now it replace the offending character «—» with a ? . Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.
There are several things I do not like from my solution.
- The code is ugly with all that decoding, encoding, and decoding.
- It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn’t happen before)
- It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.
The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an ‘Á’ U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).
So, the question:
Is there a nicer solution that makes my code agnostic from the output interface encoding?
Добрый день
Считываю utf8-файл и вывожу в консоль. При попытке вывести букву «И» возникает ошибка:
Воспроизводится на вот таком примере:
Первое слово выводится нормально, а на втором — ошибка. Никак не могу найти способ побороть проблему.
UPD: Видимо я описал проблему слишком широко, исправляюсь:
Как конвертировать «И» из utf-8 в cp1251? Для «А» всё работает, а для «И» — нет.
2 Answers
Если у вас файл записан в кодировке utf-8, то и декодировать нужно из кодировки utf-8:
Когда вы записываете текст в файл в какой-то кодировке, вы фактически превращаете внутреннее представление текста в байты в указанной кодировке. Чтобы правильно декодировать эти байты обратно во внутреннее представление, при декодировании нужно указать ту же кодировку, что и при записи.
I receive a server response, bytes:
xd0xa0xd1x83xd0xb1xd0xbbxd0xb8 xd0xa0xd0xa4 xd0x9axd0xa6xd0x91
This is for sure Cyrillic, but I’m not sure which encoding. Every attempt to decode it in Python fails:
Both results somewhat resemble Unicode-escape, but this does not work either:
There’s a web service for recovering Cyrillic texts, it is able to decode my bytes using Windows-1251:
Output (source encoding : WINDOWS-1251)
But I don’t have any more ideas as for how to approach it.
I think I’m missing something about how encoding works, so if the problem seems trivial to you, I would greatly appreciate a bit of explanation/a link to a tutorial/ some keywords for further googling.
Solution:
Windows PowerShell uses Windows-850 codepage by default, which is incapable of handling some Cyrillic characters. One fix is to change the codepage to Unicode every time starting the shell:
Here is explained how to make it the new default
Источник: