Contents

🧩 From Bytes to Characters: What Really Lies Inside a File

📖 Overview

Every file on your computer is ultimately just a sequence of 0s and 1s stored on disk. But the terms “binary file” and “text file” kept confusing me for a long time — and I suspect I’m not alone. It wasn’t until I started digging into what actually happens at the byte level that the confusion finally cleared up. In this blog I’m sharing exactly what I found during that exploration. If you’ve had the same confusion, I hope this helps.

🧱 Everything is Bytes

When you save a file — whether it’s a text document, an image, or a music file — computer doesn’t store words, colors, or sounds, it stores bytes. A file is nothing more than a sequence of bytes sitting on disk. The operating system doesn’t care what’s inside. It just writes bytes when you save and reads bytes when you open.

So where does the confusion between “text file” and “binary file” come from?

The truth is — both are bytes on disk. There is no difference at the storage level. The difference is only in how you interpret those bytes when reading them. A text file is bytes that are meant to be interpreted as readable characters. A binary file is bytes that are meant to be interpreted as something else — like an image, a sound, or a compiled program. But at the end of the day, it’s all just bytes.

Lets see this in action. If you open a text file in binary mode, you’ll see the raw bytes. For example, if you have a text file that contains the word “Hello”, and you open it in binary mode, you’ll see something like this:

👀 Seeing it in Action

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Create a file
with open("hello.txt", "w") as f:
    f.write("Hello!")

print("** Reading the file - `hello.txt` in text mode **")
# Read the file in text mode
with open("hello.txt", "r") as f:
    data = f.read()
    print(f"Text data: {data}")
    print(f"Type of data: {type(data)}")
    print(f"Length of data: {len(data)}")
    print()

# Read the file in binary mode
print("** Reading the file - `hello.txt` in binary mode **")
with open("hello.txt", "rb") as f:
    data = f.read()
    print(f"Binary data: {data}")
    print(f"Type of data: {type(data)}")
    print(f"Length of data: {len(data)}")
    print()

print("** Raw bytes on disc **")
for index, byte in enumerate(data):
    print(f"  byte[{index}] : {byte:08b}  →  0x{byte:02X}{byte:>3} →  '{chr(byte)}'")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
** Reading the file - `hello.txt` in text mode **
Text data: Hello!
Type of data: <class 'str'>
Length of data: 6

** Reading the file - `hello.txt` in binary mode **
Binary data: b'Hello!'
Type of data: <class 'bytes'>
Length of data: 6

** Raw bytes on disc **
  byte[0] : 01001000  →  0x48  →   72 →  'H'
  byte[1] : 01100101  →  0x65  →  101 →  'e'
  byte[2] : 01101100  →  0x6C  →  108 →  'l'
  byte[3] : 01101100  →  0x6C  →  108 →  'l'
  byte[4] : 01101111  →  0x6F  →  111 →  'o'
  byte[5] : 00100001  →  0x21  →   33 →  '!'

Output explained:

Output from above code snippet shows the same file being read in two different modes — text mode and binary mode. No change to the file on disk.

  • In text mode, Python reads the bytes and decodes them into a human readable string “Hello!”.
  • In binary mode, Python reads the raw bytes and displays them as a bytes object b'Hello!'.

The type of data is different in each case — in text mode it’s a string (str), while in binary mode it’s a bytes object (bytes). The length of the data is the same in both cases (6), because there are 6 bytes on disk.

when we look at every byte individually, we see something interesting — H is stored as 72, e as 101, l as 108, and so on. These are just numbers. The file is just a list of numbers. Immeditely raises the question — who decided that 72 means H? How does your computer know that byte 72 should be displayed as the letter H and not something else? thats where encoding comes in.

🔤 Character Encoding

Character encoding is the agreed upon mapping between numbers and characters. When you type the letter A, your computer doesn’t store the letter — it stores a number.

🔡 ASCII

The first widely adopted encoding was standard ASCII. It defined a mapping for 128 characters, it includes

  • English letters (A-Z, a-z)
  • Digits (0-9)
  • Some special symbols (like punctuation marks and control characters)

Examples:

CharacterDecimal CodeHex CodeBinary Code
A650x4101000001
a970x6101100001
0480x3000110000
!330x2100100001
.460x2E00101110

Note: Notice the first bit is always 0 in ASCII, which is why it only supports 128 characters (2^7 = 128). This makes it compatible with UTF-8.

1
2
3
4
5
6
text = "Hello"

print("=== ASCII values ===")
for char in text:
    code = ord(char)  # ord() gives the numeric code for a character
    print(f"  '{char}'  →  decimal: {code}  →  hex: 0x{code:02X}  →  binary: {code:08b}")

Output:

1
2
3
4
5
6
=== ASCII values ===
  'H'  →  decimal: 72   →  hex: 0x48  →  binary: 01001000
  'e'  →  decimal: 101  →  hex: 0x65  →  binary: 01100101
  'l'  →  decimal: 108  →  hex: 0x6C  →  binary: 01101100
  'l'  →  decimal: 108  →  hex: 0x6C  →  binary: 01101100
  'o'  →  decimal: 111  →  hex: 0x6F  →  binary: 01101111

ASCII worked beautifully for English, but it fell apart the moment you needed to write in other languages or use special symbols. For example, how would you represent the Lambda Symbol (λ) or letters from non-english alphabets?

To partially solve this problem, various extended versions of ASCII (ISO-8859-1, Windows-1252 etc.) were created, but they were not standardized and often incompatible with each other. This led to a lot of confusion and data corruption when sharing files across different systems and languages.

what we needed was a universal encoding that could represent every character in every language, and that’s where unicode and UTF-8 come in.

🌍 UTF-8

Unicode is a standard that assigns every character a unique number called a code point. For example, the letter A is U+0041, the Greek letter Lambda (λ) is U+03BB etc. It just says “this character has this number”. It doesn’t say how to store that number as bytes on disk. That is where UTF-8 comes in.

UTF-8 is an encoding standard that defines how to convert those Unicode code points into bytes.

  • Variable-length encoding: UTF-8 uses variable number of bytes to represent different characters. Length of bytes (Number of Bytes) depends on the Code Point Range of the character. See table below for details.
  • Byte Format: UTF-8 uses a specific Byte Format with prefix bits to indicate how many bytes are used to represent a character.
    • First byte prefix bits (0, 110, 1110, 11110) indicate how many bytes (1, 2, 3, 4) are used for the character.
    • Continuation bytes (if any) always start with 10 to indicate they are part of the same character.

below is the table showing how many bytes UTF-8 uses to encode different ranges of Unicode code points:

Code Point RangeNumber of BytesByte Format
U+0000 to U+007F1 byte0xxxxxxx
U+0080 to U+07FF2 bytes110xxxxx 10xxxxxx
U+0800 to U+FFFF3 bytes1110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF4 bytes11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

🔬 Encoding Example: λ (U+03BB)

Lets explain the UTF-8 encoding with an example.

  • Character: lambda (λ)
  • Unicode code point: U+03BB (Hexadecimal: 03BB)
  • Decimal code point: 955 (3BB in hexadecimal, calculated as 316^2 + 1116^1 + 11*16^0)
  • Binary value of code point: 0000 0011 1011 1011

Following the UTF-8 encoding rules:

  • Since U+03BB falls in the range U+0080 to U+07FF, it requires 2 bytes for UTF-8 encoding.
  • Follow Byte Format for 2 bytes:
    • First byte will have the format 110xxxxx
    • Second byte will have the format 10xxxxxx
  • To encode U+03BB in UTF-8:
    1. Convert the code point to binary: 0000 0011 1011 1011
    2. UTF-8 encoding for 2 bytes reserves:
      1. 110 as the prefix for the first byte, leaving 5 bits for the code point.
      2. 10 as the prefix for the second byte, leaving 6 bits for the code point.
      3. Total bits available for the code point in UTF-8 encoding: 5 (from first byte) + 6 (from second byte) = 11 bits.
      4. Take the 11 least significant bits of the code point: 0000 0011 1011 1011011 1011 1011
    3. Split the binary into 2 groups:
      1. Prefix: 110 + First 5 bits of the code point:011101100 1110 (0xCE)
      2. Prefix: 10 + Next 6 bits of the code point: 1110111011 1011 (0xBB)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Encoding the character 'λ' (U+03BB) in UTF-8
char = 'λ'
code_point = ord(char)  # Get the Unicode code point (decimal)
utf8_bytes = char.encode('utf-8')  # Encode the character in UTF-8
print(f"Character: '{char}'")
print(f"Unicode code point: U+{code_point:04X} (Decimal: {code_point})")
print(f"UTF-8 bytes: {utf8_bytes} (Hex: {utf8_bytes.hex().upper()})")

# Decoding the UTF-8 bytes back to character
code_bytes = b"\xCE\xBB" 
decoded_char = code_bytes.decode('utf-8')
print(f"Decoded character: '{decoded_char}'")

Output:

1
2
3
4
Character: 'λ'
Unicode code point: U+03BB (Decimal: 955)
UTF-8 bytes: b'\xce\xbb' (Hex: CEBB)
Decoded character: 'λ'

In this example, we see how the character ‘λ’ is represented in UTF-8 as the byte sequence b'\xce\xbb' (hexadecimal CEBB). When we decode those bytes back to a character, we get ‘λ’ again, demonstrating the UTF-8 encoding and decoding process.

⚠️ What Happens with Wrong Encoding?

If you try to decode those bytes using the wrong encoding (like ASCII), you’ll get an error or garbled characters, because ASCII doesn’t know how to interpret those byte values as valid characters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Encode the character 'λ' to UTF-8 bytes
char = "λ"
utf8_bytes = char.encode("utf-8")  # UTF-8 encoding of 'λ'
print(f"UTF-8 bytes: {utf8_bytes} (Hex: {utf8_bytes.hex().upper()})")

# Decoding the UTF-8 bytes back to character using ASCII (which will fail)
code_bytes = b"\xce\xbb"
try:
    decoded_char = code_bytes.decode("ascii")
    print(f"Decoded character using ASCII: '{decoded_char}'")
except UnicodeDecodeError as e:
    print(f"Error decoding with ASCII: {e}")

Output:

1
2
UTF-8 bytes: b'\xce\xbb' (Hex: CEBB)
Error decoding with ASCII: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

✅ Conclusion

In summary, the distinction between “binary file” and “text file” is not about how data is stored on disk — it’s about how we interpret that data when we read it. Both are just bytes on disk. The difference lies in the encoding used to represent characters as bytes. ASCII was the first widely adopted encoding, but it was limited to 128 characters. UTF-8 is a universal encoding that can represent every character in every language, using a variable-length encoding scheme. Understanding these concepts helps how programs read and write text, and why sometimes you might see garbled characters if the wrong encoding is used.

📚 References