An experiment with parsing Japanese

MultidimensionalStep · January 23, 2020, 8:50pm

Thank you! It would be nice if there was a way to convert the katakana and romaji into hiragana so (dict $) predicates could be shorter; that might be possible with rewrite rules if words could be broken down into characters, maybe.

Tried implementing directions, but I encountered something unfortunate:

./dialogc -o JP.aastory -t aa test.dg japanese.dg stdlib.dg
Error: Too many distinct unicode characters in the text.

I guess this is a limitation of Dialog. I can of course remove katakana and full-width romaji characters to free up some unicode space. Worst case, I’ll drop kanji support and do everything in hiragana, like an old NES era RPG. But I’d like to avoid that if I can, so I’m going to dig through Dialog’s source to see if the number can be increased, or at least to know the limit so I can plan which characters to use wisely.

EDIT: Looks like in backend_aa.c, that a total of 127 unique unicode characters can be used. This is hard coded and I’m betting wouldn’t be easy to change. I’m going to do some thinking about how to best use this limited palette.

EDIT 2: Hiragana alone takes 87 unicode slots for all variations, so katakana is out. That leaves 40 characters for punctuation and kanji. For punctuation I think 「。、」 are the only ones really necessary, so 36 for kanji. There are more verbs in the standard library than that (even allowing that some would use the same kanji, like LOOK and SEARCH), so I can’t handle all of them. Relations and directions wouldn’t take too many kanji, but it’s a bit weird having only those in kanji. So, I think I am going to limit myself to hiragana only for now.