Wednesday, April 07, 2010

Chinese character frequencies

After a long time of somewhat naïvely trying to learn Chinese by adding production flashcards for new words (where the front side is a English term with hints to avoid guessing an answer that was correct but not the one on the back side, and the back side is Chinese characters and phonetic pinyin), I realised the task was far too difficult and time-consuming. For each of those cards, I'd write the characters on a graphics tablet and speak them, then flip the card and fail it if I made any mistakes in either the writing or speech. This was needlessly laborious since there was so much redundancy and opportunity to make small mistakes even if most of the answer was correct (writing out 印制电路板 (printed circuit board) many times was extremely tedious and unproductive).

So some reading on Glowing Face Man's blog led me to switch things around a bit, changing my deck so that the only characters I would write (production) were single characters, of which there are still very very many (over 20,000!) but the most common 3,000 account for over 99% of what you'll see in actual modern Chinese. All the other cards changed to recognition, where the front side is the Chinese characters and the back side (what I speak out loud before flipping the card) is phonetic pinyin and a (sometimes rough) English translation. Rather than mess about with Anki's deck format or exporting/modifying/importing, I wrote a dodgy AppleScript program to automate moving through the deck interface and sending keystrokes to cut, paste and rearrange the text... even crappy automation can be better than changing 2,500 cards manually. In fact, it would still be better even if it took the same amount of time, because of the sense of reward that it spurs.

This has helped immensely, reducing the pain and greatly increasing throughput and efficiency. However, learning the characters still takes time - my current plan is to go through the 3,000 most common ones and learn them as production cards before carrying on with sentence recognition cards.

But why 3,000 characters? Why not half or twice that? And which ones?

That's answered here - a computer program can quickly go through a huge corpus of text and produce a sorted listing of characters by frequency. Predictably, the first couple of hundred characters account for a huge fraction of written Chinese: 200 characters will get you 55% understanding (that's "most" Chinese already, heh), 400 will get you 70%, and so on. (Of course, when I say "understanding", I'm ignoring the fact that you need to learn the grammar, idioms and so on, and which of many possible meanings a character will take on in different contexts.)

A quick plot of the numbers provided produces a roughly logarithmic shape, showing diminishing returns (given the roughly constant time required to learn characters):



So it looks like the payoff is small by the time you're hitting around 2,500 characters (98.5%), but it would be nice to say that you only don't know <1% of written Chinese when you hit 3,000 characters (99.2%), and only add more unfamiliar characters to the deck as you encounter them during reading, less and less often.

No comments:

Post a Comment