character maps to undefined

This question already has an answer here:

I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.

The code is like this:

the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:

I could fix it using this quite ugly code:

Now it replace the offending character «—» with a ? . Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.

There are several things I do not like from my solution.

  1. The code is ugly with all that decoding, encoding, and decoding.
  2. It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn’t happen before)
  3. It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.

The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an ‘Á’ U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).

So, the question:

Is there a nicer solution that makes my code agnostic from the output interface encoding?

Добрый день
Считываю utf8-файл и вывожу в консоль. При попытке вывести букву «И» возникает ошибка:

Воспроизводится на вот таком примере:

Первое слово выводится нормально, а на втором — ошибка. Никак не могу найти способ побороть проблему.

UPD: Видимо я описал проблему слишком широко, исправляюсь:

Как конвертировать «И» из utf-8 в cp1251? Для «А» всё работает, а для «И» — нет.

2 Answers

Если у вас файл записан в кодировке utf-8, то и декодировать нужно из кодировки utf-8:

Когда вы записываете текст в файл в какой-то кодировке, вы фактически превращаете внутреннее представление текста в байты в указанной кодировке. Чтобы правильно декодировать эти байты обратно во внутреннее представление, при декодировании нужно указать ту же кодировку, что и при записи.

I receive a server response, bytes:

xd0xa0xd1x83xd0xb1xd0xbbxd0xb8 xd0xa0xd0xa4 xd0x9axd0xa6xd0x91

This is for sure Cyrillic, but I’m not sure which encoding. Every attempt to decode it in Python fails:

Both results somewhat resemble Unicode-escape, but this does not work either:

There’s a web service for recovering Cyrillic texts, it is able to decode my bytes using Windows-1251:

Output (source encoding : WINDOWS-1251)

But I don’t have any more ideas as for how to approach it.

I think I’m missing something about how encoding works, so if the problem seems trivial to you, I would greatly appreciate a bit of explanation/a link to a tutorial/ some keywords for further googling.

Solution:

Windows PowerShell uses Windows-850 codepage by default, which is incapable of handling some Cyrillic characters. One fix is to change the codepage to Unicode every time starting the shell:

Here is explained how to make it the new default

Источник: computermaker.info

Понравилась статья? Поделиться с друзьями:
Ок! Компьютер
Добавить комментарий