Following the Harvard Berkman conference, I suggested possible synergy between multilingual blogging and machine translation, and later asked for information about Arabic/English machine translation vendors. I promised a report-back in the latter, and here it is, along with related information regarding multilingual blogging platforms and a nascent effort to define minimal specs for tagging translated posts. I've been posting the raw information on my home blog as it has developed, so this will be a summary of findings and discussions. And if you think this is an attempt to get others interested in contributing, you're right.
Multilingual Blogging Platforms
Though this topic started in the Arabic realm, really any platform capable of handling two-byte characters should be adaptable if its interface is translated. Blogger and SixApart will follow their commercial imperatives in this regard, so I went looking for open source options to supplement the Spirit of America Arabic blog project.
For a scalable hosted service, the leading option is the open source version of LiveJournal, which has several translation projects underway. A few days after that post, SixApart swooped in and bought the commercial LiveJournal service, so Ben and Mena now have access to that code base along with MT and Typepad. For a standalone UTF-8 capable blog, the open source leader appears to be Mambo, which also does a number of other content management tricks, and appears to already have one Arabic translation in existence.
Arabic/English Machine Translation
Thanks to commenters here and on Due Diligence, to the AAMT Compendium, and several U.S. Government sources, I've come up with six sources of Arabic/English translation that appear to have networkable products, all private companies scattered between the US, UK, Europe and the Middle East.
If you just want to check them out directly, they are: Half of these vendors once had free Web based translation sites, noted in my survey article, but every one of them has recently gone for-pay. Wonder why? Finally, the ubiquitous Systran has started on an Arabic project, getting some of the bill paid by the EU, but the quality appears to be lacking so far.Markup For Translated Posts
Following a good suggestion by Winds contributor Lewy14 to the effect that a minimal markup to denote translation would be good idea, I put up a proposed set of requirements for same. Meanwhile, Luke Razzell had made a similar suggestion. Both were noted by Kevin Marks, who suggested most of the goals could be accomplished in existing HTML. In parallel, Lewy14 and I had been discussing things in e-mail, and that dialog, interpolated with Kevin's relevant input has been posted here, with some of the more obscure issues in a second post. The comments to the first post are the best summary of where we stand at the moment. To summarize: it looks like something useful can be done within the bounds of current W3C specs, but with some serious limits on ability to denote individual translated posts within a document. We're looking for more input, and likely for followup in the form of people willing to tweak templates for existing blogging systems to accomodate the results.
Rosetta Bots
Here's the place to acknowledge a genius bit of coinage by Lewy: A Rosetta bot is a crawler that collects parallel, translated texts from the blogosphere, to build a translation database to facilitate further human translations, or to create a training database for a corpus based machine translator. The type of markup we are exploring will allow human or machine translators of blog posts to precisely denote the relationship to the original, and should do something useful in conventional HTML browsers. But an important secondary use may be as a cue for Rosetta bots, which should be easily derived from existing crawlers/HTML scrapers such as Technorati.








Wow, I love the Rosetta bot! Lewy should trademark it. :)
Here's Marti Hearst on the Search Problem:
Another development in the field of computational linguistics is the manual creation of enormous lexical ontologies, which are then used to build axioms and rules about language use. These modern ontologies, unlike their predecessors, are of a large enough scale and simple enough design to be useful, although this work is in the early stages. There are also many attempts to build such ontologies automatically from large text collections; the most promising approach seems to be to combine the automated and the manual approaches.
I wanna buy stock in your and lewy's company. :)