it-swarm.com.de

Hier erhalten Sie eine Liste aller Kodierungen, in die Python kodieren kann

Ich schreibe ein Skript, mit dem Bytes in viele verschiedene Codierungen in Python 2.6 codiert werden sollen. Gibt es eine Möglichkeit, eine Liste verfügbarer Codierungen zu erhalten, über die ich iterieren kann?

Der Grund, warum ich dies versuche, ist, dass ein Benutzer Text hat, der nicht korrekt codiert ist. Es gibt lustige Charaktere. Ich kenne den Unicode-Charakter, der ihn durcheinander bringt. Ich möchte in der Lage sein, ihnen eine Antwort zu geben wie "Ihr Texteditor interpretiert diese Zeichenfolge als X-Kodierung, nicht als Y-Kodierung". Ich dachte, ich würde versuchen, dieses Zeichen mit einer Kodierung zu kodieren, dann mit einer anderen Kodierung wieder zu dekodieren und zu sehen, ob wir dieselbe Zeichenfolge erhalten.

etwas wie dieses:

for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):
  try:
    unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)
  except:
    pass
53
Rory

Leider ist encodings.aliases.aliases.keys() KEINE passende Antwort.

aliases (wie man es erwarten sollte/sollte) enthält mehrere Fälle, in denen unterschiedliche Schlüssel demselben Wert zugeordnet werden, z. 1252 und windows_1252 sind beide cp1252 zugeordnet. Sie können Zeit sparen, wenn Sie anstelle von aliases.keys()set(aliases.values()) verwenden.

ABER ES IST EIN SCHLECHTES PROBLEM: aliases enthält keine Codecs, die keine Aliase enthalten (wie cp856, cp874, cp875, cp737 und koi8_u).

>>> from encodings.aliases import aliases
>>> def find(q):
...     return [(k,v) for k, v in aliases.items() if q in k or q in v]
...
>>> find('1252') # multiple aliases
[('1252', 'cp1252'), ('windows_1252', 'cp1252')]
>>> find('856') # no codepage 856 in aliases
[]
>>> find('koi8') # no koi8_u in aliases
[('cskoi8r', 'koi8_r')]
>>> 'x'.decode('cp856') # but cp856 is a valid codec
u'x'
>>> 'x'.decode('koi8_u') # but koi8_u is a valid codec
u'x'
>>>

Es ist auch erwähnenswert, dass Sie, auch wenn Sie eine vollständige Liste der Codecs erhalten, die Codecs ignorieren können, bei denen es sich nicht um das Codieren/Decodieren von Zeichensätzen handelt, sondern eine andere Transformation, z. zlib, quopri und base64.

Das bringt uns zu der Frage, WARUM Sie versuchen möchten, "Bytes in viele verschiedene Codierungen zu codieren". Wenn wir das wissen, können wir Sie möglicherweise in die richtige Richtung lenken.

Zum einen ist das mehrdeutig. Eine DE-Codierung von Bytes in Unicode und eine EN-Codierung in Unicode. Was möchtest du tun?

Was versuchen Sie wirklich zu erreichen: Versuchen Sie zu bestimmen, welcher Codec zum Decodieren einiger eingehender Bytes verwendet werden soll, und planen Sie dies mit allen möglichen Codecs? [Hinweis: latin1 decodiert irgendetwas] Versuchen Sie, die Sprache eines Unicode-Texts zu bestimmen, indem Sie versuchen, ihn mit allen möglichen Codecs zu kodieren? [Anmerkung: utf8 codiert alles].

26
John Machin

Andere Antworten scheinen darauf hinzudeuten, dass das Erstellen dieser Liste programmatisch schwierig und mit Fallen behaftet ist. Dies ist jedoch wahrscheinlich nicht erforderlich, da die Dokumentation eine vollständige Liste der Standardkodierungen enthält, die Python unterstützt, und dies seit Python 2.3.

Sie finden diese Listen (für jede stabile Version der bisher veröffentlichten Sprache) unter:

Nachfolgend sind die Listen für jede dokumentierte Version von Python aufgeführt. Wenn Sie Abwärtskompatibilität haben möchten, anstatt nur eine bestimmte Version von Python zu unterstützen, können Sie die Liste einfach aus der Python-Version latest kopieren und prüfen, ob in Python, in dem Ihr Programm ausgeführt wird, jede Codierung vorhanden ist versuchen, es zu benutzen.

Python 2.3 (59 Kodierungen)

['ascii',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp869',
 'cp874',
 'cp875',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8']

Python 2.4 (85 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8']

Python 2.5 (86 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 2.6 (90 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 2.7 (93 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.0 (89 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.1 (90 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.2 (92 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.3 (93 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.4 (96 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.5 (98 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_t',
 'koi8_u',
 'kz1048',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.6 (98 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_t',
 'koi8_u',
 'kz1048',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.7 (98 Kodierungen)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_t',
 'koi8_u',
 'kz1048',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Falls sie für jeden Anwendungsfall von jedermann relevant sind, beachten Sie, dass die Dokumente auch einige Python-spezifische Kodierungen auflisten, von denen viele anscheinend hauptsächlich für Pythons Interna gedacht sind oder auf andere Weise in irgendeiner Weise seltsam sind, wie der 'undefined' Codierung, die immer eine Ausnahme auslöst, wenn Sie versuchen, sie zu verwenden. Sie möchten diese wahrscheinlich vollständig ignorieren, wenn Sie wie der Fragesteller herausfinden möchten, welche Kodierung für einen Text verwendet wurde, den Sie in der realen Welt gefunden haben. Seit Python 3.6 ist die Liste wie folgt:

["idna",
 "mbcs",
 "oem",
 "palmos",
 "punycode",
 "raw_unicode_escape",
 "rot_13",
 "undefined",
 "unicode_escape",
 "unicode_internal",
 "base64_codec",
 "bz2_codec",
 "hex_codec",
 "quopri_codec",
 "string_escape",
 "uu_codec",
 "zlib_codec"]

Für den Fall, dass Sie meine Tabellen für eine neuere Version von Python aktualisieren möchten, finden Sie hier das (grobe, nicht sehr robuste) Skript, mit dem ich sie erstellt habe:

import requests
import lxml.html
import pprint

for version, url in [
    ('2.3', 'https://docs.python.org/2.3/lib/node130.html'),
    ('2.4', 'https://docs.python.org/2.4/lib/standard-encodings.html'),
    ('2.5', 'https://docs.python.org/2.5/lib/standard-encodings.html'),
    ('2.6', 'https://docs.python.org/2.6/library/codecs.html#standard-encodings'),
    ('2.7', 'https://docs.python.org/2.7/library/codecs.html#standard-encodings'),
    ('3.0', 'https://docs.python.org/3.0/library/codecs.html#standard-encodings'),
    ('3.1', 'https://docs.python.org/3.1/library/codecs.html#standard-encodings'),
    ('3.2', 'https://docs.python.org/3.2/library/codecs.html#standard-encodings'),
    ('3.3', 'https://docs.python.org/3.3/library/codecs.html#standard-encodings'),
    ('3.4', 'https://docs.python.org/3.4/library/codecs.html#standard-encodings'),
    ('3.5', 'https://docs.python.org/3.5/library/codecs.html#standard-encodings'),
    ('3.6', 'https://docs.python.org/3.6/library/codecs.html#standard-encodings'),
]:
    html = requests.get(url).text
    doc = lxml.html.fromstring(html)
    standard_encodings_table = doc.xpath(
        '//table[preceding::h2[.//text()[contains(., "Standard Encodings")]]][//th/text()="Codec"]'
    )[0]
    codecs = standard_encodings_table.xpath('.//td[1]/text()')
    print("## Python %s (%i encodings)" % (version, len(codecs)))
    print('<pre><code>' + pprint.pformat(codecs) + '</code></pre>')
47
Mark Amery

Vielleicht sollten Sie die Bibliothek Universal Encoding Detector (chardet) verwenden, anstatt sie selbst zu implementieren.

>>> import chardet
>>> s = '\xe2\x98\x83' # ☃
>>> chardet.detect(s)
{'confidence': 0.505, 'encoding': 'utf-8'}
17
Matt Nordhoff

Sie können eine Technik verwenden, um alle Module im Paket encodings aufzulisten.

import pkgutil
import encodings

false_positives = set(["aliases"])

found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found
13
u0b34a0f6ae

Ich bezweifle, dass es eine solche Methode/Funktionalität im Codecs-Modul gibt, aber wenn Sie encoding/__init__.py sehen, durchsucht die Suchfunktion den Kodierungsmodulordner.

>>> os.listdir(os.path.dirname(encodings.__file__))
['cp500.pyc', 'utf_16_le.py', 'gb18030.py', 'mbcs.pyc', 'undefined.pyc', 'idna.pyc', 'punycode.pyc', 'cp850.py', 'big5hkscs.pyc', 'mac_arabic.py', '__init__.pyc', 'string_escape.py', 'hz.py', 'cp037.py', 'cp737.py', 'iso8859_5.pyc', 'iso8859_13.pyc', 'cp861.pyc', 'cp862.py', 'iso8859_9.pyc', 'cp949.py', 'base64_codec.pyc', 'koi8_r.py', 'iso8859_2.py', 'ptcp154.pyc', 'uu_codec.pyc', 'mac_croatian.pyc', 'charmap.pyc', 'iso8859_15.pyc', 'euc_jp.py', 'cp1250.py', 'iso8859_10.pyc', 'koi8_r.pyc', 'unicode_escape.pyc', 'cp863.pyc', 'iso8859_4.pyc', 'cp852.py', 'unicode_internal.py', 'big5hkscs.py', 'cp1257.pyc', 'cp1254.py', 'shift_jisx0213.py', 'shift_jis.pyc', 'cp869.pyc', 'hp_roman8.py', 'iso8859_4.py', 'cp775.py', 'cp1251.py', 'mac_cyrillic.pyc', 'mac_greek.pyc', 'mac_roman.pyc', 'iso8859_11.pyc', 'iso8859_6.py', 'utf_8_sig.py', 'iso8859_3.py', 'iso2022_jp_1.py', 'ascii.py', 'cp1026.pyc', 'cp1250.pyc', 'cp950.py', 'raw_unicode_escape.py', 'euc_jis_2004.pyc', 'cp775.pyc', 'euc_kr.py', 'mac
_greek.py', 'big5.pyc', 'shift_jis_2004.pyc', 'gbk.pyc', 'cp1254.pyc', 'cp1255.pyc', 'cp855.pyc', 'string_escape.pyc', 'cp949.pyc', 'cp1258.pyc', 'iso8859_3.pyc', 'mac_iceland.pyc', 'cp1251.pyc', 'cp860.py', 'cp856.py', 'cp874.py', 'iso2022_kr.py', 'cp856.pyc', 'rot_13.py', 'palmos.py', 'iso2022_jp_2.pyc', 'mac_farsi.py', 'koi8_u.pyc', 'cp1256.py', 'iso8859_10.py', 'tis_620.py', 'iso8859_14.pyc', 'cp1253.py', 'cp1258.py', 'cp437.py', 'cp862.pyc', 'mac_turkish.py', 'undefined.py', 'euc_kr.pyc', 'gb18030.pyc', 'aliases.pyc', 'iso8859_9.py', 'uu_codec.py', 'gbk.py', 'quopri_codec.pyc', 'iso8859_7.py', 'mac_iceland.py', 'iso8859_2.pyc', 'euc_jis_2004.py', 'iso2022_jp_3.pyc', 'cp874.pyc', '__init__.py', 'mac_roman.py', 'iso8859_16.py', 'cp866.py', 'unicode_internal.pyc', 'mac_turkish.pyc', 'johab.pyc', 'cp037.pyc', 'punycode.py', 'cp1253.pyc', 'euc_jisx0213.pyc', 'iso2022_jp_2004.pyc', 'iso2022_kr.pyc', 'zlib_codec.pyc', 'cp932.py', 'cp1255.py', 'iso2022_jp_1.pyc', 'cp857.pyc', 'cp424.pyc',
 'iso2022_jp_2.py', 'iso2022_jp.pyc', 'mbcs.py', 'utf_8.py', 'palmos.pyc', 'cp1252.pyc', 'aliases.py', 'quopri_codec.py', 'latin_1.pyc', 'iso2022_jp.py', 'zlib_codec.py', 'cp1026.py', 'cp860.pyc', 'cp1252.py', 'hex_codec.pyc', 'iso8859_1.pyc', 'cp850.pyc', 'cp861.py', 'iso8859_15.py', 'cp865.pyc', 'hp_roman8.pyc', 'iso8859_7.pyc', 'mac_latin2.py', 'iso8859_11.py', 'mac_centeuro.pyc', 'iso8859_6.pyc', 'ascii.pyc', 'mac_centeuro.py', 'iso2022_jp_3.py', 'bz2_codec.py', 'mac_arabic.pyc', 'euc_jisx0213.py', 'tis_620.pyc', 'shift_jis_2004.py', 'utf_8.pyc', 'cp855.py', 'mac_romanian.pyc', 'iso8859_8.py', 'cp869.py', 'ptcp154.py', 'utf_16_be.py', 'iso2022_jp_ext.pyc', 'bz2_codec.pyc', 'base64_codec.py', 'latin_1.py', 'charmap.py', 'hz.pyc', 'cp950.pyc', 'cp875.pyc', 'cp1006.pyc', 'utf_16.py', 'shift_jisx0213.pyc', 'cp424.py', 'cp932.pyc', 'iso8859_5.py', 'mac_romanian.py', 'utf_8_sig.pyc', 'iso8859_1.py', 'cp875.py', 'cp437.pyc', 'cp865.py', 'utf_7.py', 'utf_16_be.pyc', 'rot_13.pyc', 'euc_jp.p
yc', 'raw_unicode_escape.pyc', 'iso8859_8.pyc', 'utf_16.pyc', 'iso8859_14.py', 'iso8859_16.pyc', 'cp852.pyc', 'cp737.pyc', 'mac_croatian.py', 'mac_latin2.pyc', 'iso2022_jp_ext.py', 'cp1140.py', 'mac_cyrillic.py', 'cp1257.py', 'cp500.py', 'cp1140.pyc', 'shift_jis.py', 'unicode_escape.py', 'cp864.py', 'cp864.pyc', 'cp857.py', 'hex_codec.py', 'mac_farsi.pyc', 'idna.py', 'johab.py', 'utf_7.pyc', 'cp863.py', 'iso8859_13.py', 'koi8_u.py', 'gb2312.pyc', 'cp1256.pyc', 'cp866.pyc', 'iso2022_jp_2004.py', 'utf_16_le.pyc', 'gb2312.py', 'cp1006.py', 'big5.py']

da aber jeder einen Codec registrieren kann, ist die Liste nicht erschöpfend.

4
Anurag Uniyal

Der Python-Quellcode hat ein Skript unter Tools/unicode/listcodecs.py, das alle Codecs auflistet.

Unter den aufgeführten Codecs gibt es jedoch einige, die keine Unicode-zu-Byte-Konverter sind, wie base64_codec, quopri_codec und bz2_codec, wie @John Machin darauf hinweist.

2
Luciano Ramalho

Wahrscheinlich können Sie das tun:

from encodings.aliases import aliases
print aliases.keys()
0
fjarri

Es gibt eine programmatische Methode zum Auflisten aller im stdlib-Kodierungspaket definierten Kodierungen. Beachten Sie, dass hier keine benutzerdefinierten Kodierungen aufgeführt werden. Dies kombiniert einige der Tricks in den anderen Antworten, erzeugt jedoch tatsächlich eine Arbeitsliste unter Verwendung des kanonischen Namens des Codecs.

import encodings
import pkgutil
import pprint


all_encodings = set()

for _, modname, _ in pkgutil.iter_modules(
        encodings.__path__, encodings.__+ '.',
):
    try:
        mod = __import__(modname, fromlist=[str('__trash')])
    except (ImportError, LookupError):
        # A few encodings are platform specific: mcbs, cp65001
        # print('skip {}'.format(modname))
        pass

    try:
        all_encodings.add(mod.getregentry().name)
    except AttributeError as e:
        # the `aliases` module doensn't actually provide a codec
        # print('skip {}'.format(modname))
        if 'regentry' not in str(e):
            raise

pprint.pprint(sorted(all_encodings))
0
Anthony Sottile