How Many Chinese Characters Do You Actually Need?

In the early 20th Century, Italian economist Vilfredo Pareto discovered that 20 per cent of the pea pods in his garden produced 80 per cent of his peas. This, along with the discovery that 20 per cent of landholders in Italy held 80 per cent of the land, presented a general principle which would eventually be called the Pareto principle. In terms of productivity, 20 per cent of a population often seem to produce about 80 per cent of results.

If Pareto had been a student of Chinese, he might have discovered the same thing.

The most common answer to the question of how many Chinese characters a person requires to read Chinese is about 4,000. This is not a bad answer, all told, but can give the very wrong impression that all 4,000 of these characters are equally important. This is not at all the case.

According to data from Jun Da at Tsinghua University, the percentage of the average text that a reader will understand according to how many characters known looks something like this.

3,000 characters will allow a reader to understand about 99.2 per cent of a text. 5,000 will cover 99.9 per cent. The most frequent 100 characters, however, will make up an enormous 42 per cent of the total. To read half of all Chinese characters, a reader only needs a paltry 152 characters.

To put it another way, the most frequent character in the Chinese language, 的， appears approximately once every 24 characters. The 3,000th most common character, 忖, on the other hand, appears only every 100,000 characters (which means a medium sized novel would, on average, use it only about five times.) The least common character in Da’s study at 9,933rd place, 鴒, will come in handy every 200-million characters.

A mere 3,000 characters, however, will allow a reader an average comprehension of about 99 per cent, enough to be considered basically literate. And sure enough, as Pareto might have predicted, 20 per cent of the requirement for literacy, 600 characters, provides 79.6 per cent comprehension.

There are, of course, some inaccuracies with this model. For one, Jun Da’s data is old (2004) and since then Chinese vocabulary, especially on the internet, has probably centralized more around commonly used characters. Also, most actual Chinese learners do not necessarily start with the most frequent characters and move outwards. The character for “eggplant,” for example, is 3136th on the list, but I learned the character very early because eggplants are delicious.

That means that if you test yourself on comprehension and compare yourself to Jun Da’s numbers, you will likely underestimate the number of characters you know. Likewise, if you know how many characters you know, your actual comprehension will likely be lower than Jun Da’s numbers suggest (unless you learned them all exactly in order.)