HYA: Honyaku Archive technical discussion

I have started a mailing list, on which I hope people are going to help me put together a new full-text search archive for Honyaku. I'm also hoping to get a smidgen of experience of running a mailing list.

This is the home for the archives of Honyaku, an Internet mailing list for discussion of translation between Japanese and English. More information about Honyaku can be found at the Honyaku Home Page. More information about Japanese-English translation can be found here.

Opening Post

Here is a copy of my first post to the list, setting out my current (not particularly well-organised) ideas.

Preliminaries

The full-text search of the Honyaku mailing list provided by Asako Mizuno is not going to be available indefinitely. I assume that a continuation of this function is desirable. I also think it is important for this to be quite independent of the list hosting service, at least while this is a proprietary entity (such as Google or Yahoo).

A few months ago, Tom Gally announced that he wanted to pass the (separate) text archive web page on to someone else, and I offered to host it. From some time earlier I had been thinking of ways in which there might be improvements on the existing archive search (means mojibake problems, and junk elimination (in*digest*ion)). So it seems this may all be serendipitous.

So far, I have moved the text archive here: http://imaginatorium.org/honyaku
The text archive is horribly out of date (as is the Mizuno DB).

Things to be done

(my current ideas: please correct any misunderstandings or false assumptions you think I am making)

Automatic archiving. AFAICS, I can just subscribe a script to Honyaku (wherever it is hosted), and use that to save the post automatically. This would cut out the manual operations - all this saving in mail readers and so on. I suppose the thing to do is to save the posts untouched - i.e. before messing with the encoding.

I guess the way to put the stuff in a DB is to convert it all to one encoding (UTF-8 being the obvious choice). (I guess that one problem with the Mizuno system is that it uses a 'legacy' encoding such as EUC.)

Then we need "full-text" search. Happily, MySQL is all set up to do this, but unhappily, this won't do, because it only knows about explicit-word-separation languages. The simplest approach to this is to make the Japanese bits look "like" English, by splitting them up into "words": this is called "parsing". The Mizuno system (based on something from IBM called Infosearch) appears not to parse, which has the advantage that you can find any string within the text. (I recently found an article by Jim Breen, comparing the behaviour of various services, including Google etc, in this regard; can't lay my hands on an URL though.)

The simplest approach would be, I believe, to use a standard parser such as Chasen, to break up the text, then submit to MySQL's full-text indexing. In principle this involves only bolting together standard components, and writing the web pages for accessing them, so it should not take too long. The question is whether non-parsed searching really is desirable (feasible, etc.), and the metaquestion is how can we get an opinion on this from general users.

Then there's WHAM! technology. "Wikification of (Headers And) Metadata". Obviously the text is not to be interfered with, but user feedback could be used to untangle encoding problems. The way to do this is to associate a "Mojibake" button with each presented message: when the user can't read something, she* presses the button, and gets a bunch of candidate reencodings of portions of the text - then with luck, selects the correct one. This is a "vote" for reencoding that message. Since everything is actually shown on the website in UTF-8, this is a bit messy: it means, for example, getting a message converted to UTF-8 from what was claimed to be ISO-2022-JP, and reinterpreting it as Shift-JIS, then converting the Shift-JIS to UTF-8 (only actually one conversion, isn't it). But as these votes accumulate, the stuff gradually gets unbaked. Of course there are still problems, when people quote bits of one encoding within another, but again user feedback could be used to identify these cases, and an administrator could edit the quoted bit only. (It's just too complicated having users identifying subsections of posts - I think.)

* Yeah, this is sexist. It assumes all real Boys can read mojibake anyway.

Similarly, the dreaded digest quotations could be flagged as suspects with a little bit of work, then users could vote to have them removed. (Obviously, not actually _removed_: you always have an option to display the full set. Or an admin could chop out the digest part - this could be made almost automatic.)

Another feature which would be useful, and easy to implement, is a by-thread view (modulo the obvious problems with thread headers; but WHAM could help here too). (I guess too, that suppose we move to GG, a search could provide a link to the thread view at Google for people who want that particular format.) Suppose we have the text all successfully being stored in an up-to-date fashion, and publish every note as an html page (is this wanted? Honyaku content is not currently exposed in a web search), then a quick hack to get something going is to get them indexed at Google.

Before this though: another task is to munge the email addresses within messages. Does anyone know any standard components for doing this? (I think that Google archives show a link for an email address, which can be used to send a message to that address if it is still valid; but none of us - i.e. the Honyaku membership - can get access to the address itself. If for some archival reason someone wants to know the address J Tanaka used in 1997, they can't find it. (Hmm, I suppose an enterprising spammer could search for "mailing list archive", download the zip files, and scrape the addies out...)

OK, that's enough general stuff. I also have a few immediate specific questions...

I use a shared server at pair.com (aka pairlist.net) for web hosting, and have experience using PHP and MySQL. Full-text searching is "not for the faint of heart" (Jim Breen - thanks), and I don't have a good feel for possible scaling problems here. (This isn't helped by not having any information at all about the frequency of archive searches, but perhaps I can find something out about that.) I think the total volume of raw text is around 200-250 MB, which is not vast, nowadays.

Mailbox format: the raw archive files are in "standard Unix mailbox" format. AFAICFOFAWS, there is no formal standard; you just concatenate the raw mail files. Trouble is the spec for these seems to say they start with a line "From so-and-so", but there's no indication how you would distinguish this from a line in the message body that happened to start "From so-and-so" (without the quotes!) Any experts?

Finally: Apologies for the name of this mailing list! I should have called it HYA-tech, or anything other than the first thing that came into my head. Let's use HYA for "Honyaku Archives" as a sort of nickname.

Brian Chandler
http://imaginatorium.org

HYA: the mailing list for Honyaku Archive technical discussion

Opening Post

Preliminaries

Things to be done

OK, that's enough general stuff. I also have a few immediate specific questions...