Vague vagaries: Chinese

Showing posts with label Chinese. Show all posts

Tuesday, July 13, 2010

On a minor roll...

Been fighting off a slight backlog of flashcards in Anki, pretty much due to downloading two French card decks as a refresher before heading to a conference in France next month.

After about 25 minutes going through some of the French cards, I had to leave them and pay some attention to the Chinese deck I've been building for a couple of years now. The cards are about 25% single characters (production), 40% multiple-character words (recognition) and 35% sentences (recognition), after switching sometime in December/January from production-only words/characters to make things more practical.

For some reason, it went really well today and I was on a roll, correctly answering some 83 cards in a row. No new cards, but a few recently-added ones which weren't completely familiar. Usually I'm happy to get 80-90% right, depending on tiredness.

It seems most efficient (at least if you occasionally create backlogs by not reviewing all of the cards on the day they're due) to review cards in order of their interval (i.e. if I don't finish all my cards today, at least I'll have dealt with the ones that were most in need of revising - seeing a 1 month card two weeks late is better than seeing an 8 day card four days late), so quit while I was ahead when asked to write the character for ginger (with an interval of around 2 months)^*.

Minor arbitrary happy moment!

* I've been told that I should use less parenthesised clauses when writing. Makes sense... Actually thinking about it now, if I extracted the bits in parentheses from that sentence into footnotes, and then saw how silly the result looked, it might make refactoring the text more obvious. On the other hand, it's 2am and I'm supposed to have a meeting at 9:30am :/

Wednesday, April 07, 2010

Chinese character frequencies

After a long time of somewhat naïvely trying to learn Chinese by adding production flashcards for new words (where the front side is a English term with hints to avoid guessing an answer that was correct but not the one on the back side, and the back side is Chinese characters and phonetic pinyin), I realised the task was far too difficult and time-consuming. For each of those cards, I'd write the characters on a graphics tablet and speak them, then flip the card and fail it if I made any mistakes in either the writing or speech. This was needlessly laborious since there was so much redundancy and opportunity to make small mistakes even if most of the answer was correct (writing out 印制电路板 (printed circuit board) many times was extremely tedious and unproductive).

So some reading on Glowing Face Man's blog led me to switch things around a bit, changing my deck so that the only characters I would write (production) were single characters, of which there are still very very many (over 20,000!) but the most common 3,000 account for over 99% of what you'll see in actual modern Chinese. All the other cards changed to recognition, where the front side is the Chinese characters and the back side (what I speak out loud before flipping the card) is phonetic pinyin and a (sometimes rough) English translation. Rather than mess about with Anki's deck format or exporting/modifying/importing, I wrote a dodgy AppleScript program to automate moving through the deck interface and sending keystrokes to cut, paste and rearrange the text... even crappy automation can be better than changing 2,500 cards manually. In fact, it would still be better even if it took the same amount of time, because of the sense of reward that it spurs.

This has helped immensely, reducing the pain and greatly increasing throughput and efficiency. However, learning the characters still takes time - my current plan is to go through the 3,000 most common ones and learn them as production cards before carrying on with sentence recognition cards.

But why 3,000 characters? Why not half or twice that? And which ones?

That's answered here - a computer program can quickly go through a huge corpus of text and produce a sorted listing of characters by frequency. Predictably, the first couple of hundred characters account for a huge fraction of written Chinese: 200 characters will get you 55% understanding (that's "most" Chinese already, heh), 400 will get you 70%, and so on. (Of course, when I say "understanding", I'm ignoring the fact that you need to learn the grammar, idioms and so on, and which of many possible meanings a character will take on in different contexts.)

A quick plot of the numbers provided produces a roughly logarithmic shape, showing diminishing returns (given the roughly constant time required to learn characters):

So it looks like the payoff is small by the time you're hitting around 2,500 characters (98.5%), but it would be nice to say that you only don't know <1% of written Chinese when you hit 3,000 characters (99.2%), and only add more unfamiliar characters to the deck as you encounter them during reading, less and less often.

Monday, February 02, 2009

Chinese text input on OS X: ITABC vs FIT

After growing somewhat accustomed to (the disappointment of) the ITABC Pinyin Chinese input method that comes with OS X, I configured my Vista work box to add the MS Pinyin input method which I soon discovered was far superior to Apple's ITABC. Partly this is due to some crasher bugs in ITABC (which I've reported and never heard back about, so maybe it only happens on my Mac?) - for example, typing any string with "shish" in it will cause part of it to crash, so that SCIM must be manually killed to force the input method system to restart before Chinese can be typed again.
More importantly however, it's just so much easier to type in the input method for Windows. I can type a full sentence and, often enough, the whole thing will be interpreted as I intended, or the small number of corrections can be elegantly entered without deleting interceding correct characters. In the Apple ITABC method, it has a strange heuristic of trying to forcibly group pairs of characters at a time, even when two single characters are much more likely. This results in an erroneous offset which often propagates all the way through the sentence so that in practice one ends up correcting the input method every character or two (by hitting space and selecting the correct match) and/or accepting then going back and correcting input manually. Not only this, but some words like 儿 completely throw off the parser - if you type 'dianer' the result is '嗲呢日' (dia3 ne ri4) rather than the obvious '点儿'.

After using the Microsoft Pinyin IME briefly in college and coming home to be stuck with this again, I decided enough was enough, and started searching for alternative input methods. My brief search took me to OpenVanilla, something else that didn't work well, and finally "Fun Input Toy", a beguilingly-named input method which I downloaded from here.
After installing it mostly blind because my Chinese is absolutely not good enough to run programs or read technical documents (or, eh, any documents except for kids' books really) and wincing at the Chinese-only menus, I soon got it working (because the "Next" button in the installer wasn't translated, but you know the position it's in anyway :D). I was initially impressed, but decided to keep my enthusiasm somewhat checked before jumping to conclusions. Not for long though, because it soon became apparent that writing Chinese sentences with FIT is much easier and quicker than ITABC, and it's not as buggy.
By way of comparison, here's the result of typing the string "zheshougemeiyounashougenamehaoting" in both input methods without corrections:
ITABC: 折寿个没有拿手个那么 (4/12 -> not gonna even try translating that mess)
FIT: 这首歌没有那首歌那么好听。(12/12 -> "This song is not as nice as that song.")

Note that in ITABC, once I'd typed the pinyin string, I had to hit space once to start parsing, which yields 折寿, then space again for 个, again for 没有, again for 拿手 and so on. Note that it terminates after 那么 because it only accepts input of up to 10 characters, which means breaking mid-sentence (in practice, after only a few words because the parser gets so confused).
Also note that I typed sentences like this a few times under both systems to allow any learning mechanisms to observe my use of less common words like 歌 (ge1: song).

Also note how ITABC and FIT look once I've typed the entire string in and not hit space yet:

ITABC:

Fun Input Toy:

The FIT input window clearly shows much more information (such as the fact that it parses as much of the sentence as possible, with appropriate options for corrections, compared to ITABC only parsing a couple of characters at a time; usually two).

In summary, ITABC is pretty awful, FIT is very nice. And it's free, so use it!

Tuesday, November 11, 2008

randomly entertaining Chinese sentence

I check pretty much every new word I come across in my study materials on dict.cn, which also lists example sentences containing the term.

Here's a nice one I found when looking up the character 肥 (fei2), which according to my book means "(of clothes, shoes) loose*, (of animals) fat":

我被两个肥胖的女人紧紧地挤在中间，以致于要站起来下公共汽车都很难。
I was so tightly wedged between two fat women that it was difficult for me to get up and leave the bus.

:D :D

* although I couldn't substantiate the first meaning from the online dictionary..?

Vague vagaries