Basics of Encoding and Decoding
What is a Unicode?
Unicode is a unique number for every character irrespective of the spoken language (JP, En, Fr, etc.,) they come from or the programming language (Python, Java, etc.,) they are used in
What is the purpose of Unicode?
There are innumerable number of languages in this world. Some follow the Latin writing system (English, French, Spanish, etc.,), and there are so many non-latin writing styles when we look at Asian languages. Unicodes are unique numerical representations for each character of the known, major languages in the world.
- The uniqueness of the unicodes help in transmission of information in digial channels
Again, What is encoding and decoding, if you wonder
- Computers transmit information in bytes. Encoding is the process of converting unicodes to bytes
- Decoding is the process of converting bytes back to unicodes so humans can interpret
What is Unicode Character Set (UCS)
- For all major languages in the world, every unique character is assigned a unique value or “code point”. This set of unique values, also representing emojis and other symbols, is the Unicode Character Set. Unicode includes characters from Latin, Greek, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and many others.
- Code points are typically represented in hexadecimal format, such as U+0041 for the Latin capital letter “A” or U+30A2 for the Japanese hiragana character “ア”.
What are some of the commonly used Encoding techniques
| Encoding Type | Full Description | Num of bits | Where Used/Supported Character set |
|---|---|---|---|
| ASCII | American Standard Code for Information Interchange | 7 bits | For English text/ supports basic Latin letters, numbers and punctuation marks |
| UTF-8 | Unicode Transformation Format | variable-length min 8 bits | Can support multiple languages; 8 bits for most ASCII characters; Supports upto 32 bits for some characters |
| UTF-16 | Unicode Transformation Format | variable-length min 16 bits | Commonly used for applications which require multi-lang support |
| Latin-1 | ISO-8859-1 or Western European Encoding | 8 bits | Limited to Western European languages and does not cover entire unicode characters set |
| UTF-32 | Unicode Transformation Format | fixed-length 32 bits | Provides direct mapping between unicodes and characters; Less commonly used; High Storage |
Encoding and Decoding Strings in Python
- In Python, all strings by default are
Unicodestrings - If it is unicode, computer reads it by “encoding” into a byte string
- By default, Python uses
utf-8encoding. You can also encode inutf-16
byte_string = "センティル・クマール".encode()
byte_stringb'\xe3\x82\xbb\xe3\x83\xb3\xe3\x83\x86\xe3\x82\xa3\xe3\x83\xab\xe3\x83\xbb\xe3\x82\xaf\xe3\x83\x9e\xe3\x83\xbc\xe3\x83\xab'byte_string_utf16 = "センティル・クマール".encode('utf-16')
byte_string_utf16b'\xff\xfe\xbb0\xf30\xc60\xa30\xeb0\xfb0\xaf0\xde0\xfc0\xeb0'print(byte_string.decode())
print(byte_string_utf16.decode('utf-16'))センティル・クマール
センティル・クマールAbout Byte Strings in Python
Byte strings are used to represent binary data, such as images, audio files, or serialized objects. Binary data is not directly representable as text and needs to be stored and processed as a sequence of bytes.
>> type(byte_string)
bytes- It is possible to save the byte strings directly in python using the prefix “b”
>> forced_byte_string = b"some_string"
>> type(forced_byte_string)
bytes- It is NOT possible to save Non-ASCII characters as byte strings
forced_byte_string = b"センティル・クマール"SyntaxError: bytes can only contain ASCII literal characters.- One example of using byte strings is when we serialize objects (such as python objects) using pickle module
import pickle
an_example_dict = {
"English": "Senthil Kumar",
"Japanese": "センティル・クマール",
"Chinese": "森蒂尔·库马尔",
"Korean": "센틸 쿠마르",
"Arabic": "سينتيل كومار",
"Urdu": "سینتھل کمار"
}
serialized_data = pickle.dumps(an_example_dict)
print(type(serialized_data))
with open("serialized_dict.pkl", "wb") as file:
file.write(serialized_data)bytesEncoding and Decoding Files in Python
Saving Text Files in ASCII and UTF Formats
- The below code will throw NO error, because it is a English only text
normal_text = 'Hot: Microsoft Surface Pro 4 Tablet Intel Core i7 8GB RAM 256GB.. now Pound 1079.00! #SpyPrice #Microsoft'
with open("saving_eng__only_text.txt","w",encoding="ascii") as f:
f.write(normal_text)- The below code will throw an error, because you have latin character “£”
non_ascii_text = 'Hot: Microsoft Surface Pro 4 Tablet Intel Core i7 8GB RAM 256GB.. now £1079.00! #SpyPrice #Microsoft'with open("saving_eng__only_text.txt","w",encoding="ascii") as f:
f.write(non_ascii_text)---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
Input In [21], in <cell line: 1>()
1 with open("saving_a_latin_string.txt","w",encoding="ascii") as f:
----> 2 f.write(non_ascii_text)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 70: ordinal not in range(128)- Changing the encoding to “utf-8” fixed the error
with open("saving_a_latin_string.txt","w",encoding="utf-8") as f:
f.write(non_ascii_text )Saving Non-ASCII JSON Files in different formats
- Saving a dict using
json.dump, utf-8 encoding - Saving the same dict as a json_string using
json.dumps, utf-8 encoding - Saving the same dict using
json.dump, utf-16 encoding
import json
an_example_dict = {
"English": "Senthil Kumar",
"Japanese": "センティル・クマール",
"Chinese": "森蒂尔·库马尔",
"Korean": "센틸 쿠마르",
"Arabic": "سينتيل كومار",
"Urdu": "سینتھل کمار"
}
with open("saving_the_names_dict_utf8.json","w",encoding="utf-8") as f:
json.dump(an_example_dict, f,ensure_ascii=False)
an_example_dict_str = json.dumps(an_example_dict,ensure_ascii=False)
with open("saving_the_names_dict_utf8_using_json_string.json","w",encoding="utf-8") as f:
f.write(an_example_dict_str)
with open("saving_the_names_dict_utf16.json","w",encoding="utf-16") as f:
json.dump(an_example_dict, f,ensure_ascii=False)- How to load the dict?
with open("saving_the_names_dict_utf8.json","r",encoding="utf-8") as f:
loaded_dict = json.load(f)
print(loaded_dict){'English': 'Senthil Kumar', 'Japanese': 'センティル・クマール', 'Chinese': '森蒂尔·库马尔', 'Korean': '센틸 쿠마르', 'Arabic': 'سينتيل كومار', 'Urdu': 'سینتھل کمار'}>> cat saving_the_names_dict_utf8.json
{"English": "Senthil Kumar", "Japanese": "センティル・クマール", "Chinese": "森蒂尔·库马尔", "Korean": "센틸 쿠마르", "Arabic": "سينتيل كومار", "Urdu": "سینتھل کمار"}
>> echo "the file size:" && du -hs saving_the_names_dict.jsonecho "the utf8 file size in bytes:" && wc -c saving_the_names_dict_utf8.json
echo "the utf8 file size in bytes:" && wc -c saving_the_names_dict_utf8_using_json_string.json
echo "the utf16 file size in bytes:" && wc -c saving_the_names_dict_utf16.json
the utf8 file size in bytes:
209 saving_the_names_dict_utf8.json
the utf8 file size in bytes:
209 saving_the_names_dict_utf8_using_json_string.json
the utf16 file size in bytes:
292 saving_the_names_dict_utf16.jsonConclusion: - In the example above, the byte size of utf16 file is more than that of utf8 file
Conclusion
- Use
utf8everywhere | check more here- UTF-8 can be used to encode anything that UTF-16 can. So most of the usecases can be met with utf-8.
- UTF-16 starts with a minimum of 2 bytes (16-bits) and hence not compatible with 7 bit ASCII. But UTF-8 is backwards compatible with ASCII.
Good Sources
- Why UTF-8 should be used?
- https://stackoverflow.com/a/18231475
- http://utf8everywhere.org/
- Other good resources
- Encoding-Decoding in Python 3 https://www.pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/