Text: its importance on the internet goes without saying. It’s the first “T” in “HTTP”, the only “T” in “HTML”, and virtually every website uses it somehow, be it a URL, a piece of marketing copy, a product review, a viral Tweet, or a blog post. (Hi there!)
But, web text might not actually be as simple as you think. Consider the thousands of languages spoken today, or all the punctuation and symbols we can add to enhance them, or the fact that new emojis are being created to capture every human emotion. How do websites store and process all of this?
The truth is, even something as basic as text requires a well-coordinated, clearly-defined system to appear in web browsers. In this post, I’ll explain the basics of one technology central to text on the web, UTF-8. We’ll learn the basics of text storage and encoding, and discuss how it helps put engaging words across your site.
UTF-8 stands for “Unicode Transformation Format - 8 bits.” That’s not helpful to us yet, so let’s rewind to the basics.
Binary: How Computers Store Information
In order to store information, computers use a binary system. In binary, all data is represented in sequences of 1s and 0s. The most basic unit of binary is a bit, which is just a single 1 or 0. The next largest unit of binary, a byte, consists of 8 bits. An example of a byte is “01101011”.
Every digital asset you’ve ever encountered — from software to mobile apps to websites to Instagram stories — is built on this system of bytes, which are strung together in a way that makes sense to computers. When we refer to file sizes, we’re referencing the number of bytes. For example, a kilobyte is roughly one thousand bytes, and a gigabyte is roughly one billion bytes.
Text is one of many assets that computers store and process. Text is made up of individual characters, each of which is represented in computers by a string of bits. These strings are assembled to form digital words, sentences, paragraphs, romance novels, and so on.
ASCII: Converting Symbols to Binary
The American Standard Code for Information Interchange (ASCII) was an early standardized encoding system for text. Encoding is the process of converting characters in human languages into binary sequences that computers can process.
ASCII’s library includes every upper-case and lower-case letter in the Latin alphabet (A, B, C…), every digit from 0 to 9, and some common symbols (like /, !, and ?). It assigns each of these characters a unique three-digit code and a unique byte.
Unicode: A Way to Store Every Symbol, Ever
Enter Unicode, an encoding system that solves the space issue of ASCII. Like ASCII, Unicode assigns a unique code, called a code point, to each character. However, Unicode’s more sophisticated system can produce over a million code points, more than enough to account for every character in any language.
Unicode is now the universal standard for encoding all human languages. And yes, it even includes emojis.
So, we now have a standardized way of representing every character used by every human language in a single library. This solves the issue of multiple labeling systems for different languages — any computer on Earth can use Unicode.
But, Unicode alone doesn’t store words in binary. Computers need a way to translate Unicode into binary so that its characters can be stored in text files. Here’s where UTF-8 comes in.
UTF-8: The Final Piece of the Puzzle
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
There are other encoding systems for Unicode besides UTF-8, but UTF-8 is unique because it represents characters in one-byte units. Remember that one byte consists of eight bits, hence the “-8” in its name.
More specifically, UTF-8 converts a code point (which represents a single character in Unicode) into a set of one to four bytes. The first 256 characters in the Unicode library — which include the characters we saw in ASCII — are represented as one byte. Characters that appear later in the Unicode library are encoded as two-byte, three-byte, and eventually four-byte binary units.