Winds of Change.NET: Liberty. Discovery. Humanity. Victory.

Formal Affiliations
  • Anti-Idiotarian Manifesto
  • Euston Democratic Progressive Manifesto
  • Real Democracy for Iran!
  • Support Denamrk
  • Million Voices for Darfur
  • milblogs
Syndication
 Subscribe in a reader

An Update on Multilingual and Translated Blogs

| 1 Comment

Following the Harvard Berkman conference, I suggested possible synergy between multilingual blogging and machine translation, and later asked for information about Arabic/English machine translation vendors. I promised a report-back in the latter, and here it is, along with related information regarding multilingual blogging platforms and a nascent effort to define minimal specs for tagging translated posts. I've been posting the raw information on my home blog as it has developed, so this will be a summary of findings and discussions. And if you think this is an attempt to get others interested in contributing, you're right.

Multilingual Blogging Platforms

Though this topic started in the Arabic realm, really any platform capable of handling two-byte characters should be adaptable if its interface is translated. Blogger and SixApart will follow their commercial imperatives in this regard, so I went looking for open source options to supplement the Spirit of America Arabic blog project.

For a scalable hosted service, the leading option is the open source version of LiveJournal, which has several translation projects underway. A few days after that post, SixApart swooped in and bought the commercial LiveJournal service, so Ben and Mena now have access to that code base along with MT and Typepad. For a standalone UTF-8 capable blog, the open source leader appears to be Mambo, which also does a number of other content management tricks, and appears to already have one Arabic translation in existence.

Arabic/English Machine Translation

Thanks to commenters here and on Due Diligence, to the AAMT Compendium, and several U.S. Government sources, I've come up with six sources of Arabic/English translation that appear to have networkable products, all private companies scattered between the US, UK, Europe and the Middle East.

If you just want to check them out directly, they are: Half of these vendors once had free Web based translation sites, noted in my survey article, but every one of them has recently gone for-pay. Wonder why? Finally, the ubiquitous Systran has started on an Arabic project, getting some of the bill paid by the EU, but the quality appears to be lacking so far.

Markup For Translated Posts

Following a good suggestion by Winds contributor Lewy14 to the effect that a minimal markup to denote translation would be good idea, I put up a proposed set of requirements for same. Meanwhile, Luke Razzell had made a similar suggestion. Both were noted by Kevin Marks, who suggested most of the goals could be accomplished in existing HTML. In parallel, Lewy14 and I had been discussing things in e-mail, and that dialog, interpolated with Kevin's relevant input has been posted here, with some of the more obscure issues in a second post. The comments to the first post are the best summary of where we stand at the moment. To summarize: it looks like something useful can be done within the bounds of current W3C specs, but with some serious limits on ability to denote individual translated posts within a document. We're looking for more input, and likely for followup in the form of people willing to tweak templates for existing blogging systems to accomodate the results.

Rosetta Bots

Here's the place to acknowledge a genius bit of coinage by Lewy: A Rosetta bot is a crawler that collects parallel, translated texts from the blogosphere, to build a translation database to facilitate further human translations, or to create a training database for a corpus based machine translator. The type of markup we are exploring will allow human or machine translators of blog posts to precisely denote the relationship to the original, and should do something useful in conventional HTML browsers. But an important secondary use may be as a cue for Rosetta bots, which should be easily derived from existing crawlers/HTML scrapers such as Technorati.

1 Comment

Wow, I love the Rosetta bot! Lewy should trademark it. :)
Here's Marti Hearst on the Search Problem:

Another development in the field of computational linguistics is the manual creation of enormous lexical ontologies, which are then used to build axioms and rules about language use. These modern ontologies, unlike their predecessors, are of a large enough scale and simple enough design to be useful, although this work is in the early stages. There are also many attempts to build such ontologies automatically from large text collections; the most promising approach seems to be to combine the automated and the manual approaches.

I wanna buy stock in your and lewy's company. :)

Leave a comment

Here are some quick tips for adding simple Textile formatting to your comments, though you can also use proper HTML tags:

*This* puts text in bold.

_This_ puts text in italics.

bq. This "bq." at the beginning of a paragraph, flush with the left hand side and with a space after it, is the code to indent one paragraph of text as a block quote.

To add a live URL, "Text to display":http://windsofchange.net/ (no spaces between) will show up as Text to display. Always use this for links - otherwise you will screw up the columns on our main blog page.




Recent Comments
  • Joe Katzman: No, Andrew, I did not. Glad to hear it. read more
  • Joe Katzman: I didn't say it was necessarily new, though humans hadn't read more
  • Joe Katzman: I'm not so sure about the British, Grim, but characterizing read more
  • dfkling: While I tend to agree with the majority of the read more
  • Jeff Medcalf: I have several issues with this. First, I disagree with read more
  • Tim Oren: I wonder what is the correlation between countries where military read more
  • Alchemist: Good post by the way, and I largely agree with read more
  • Grim: Hm. "We would never pay bribes, which is illegal. This read more
  • Grim: Smart, yes, but what's the evidence that it's new, i.e., read more
  • Armed Liberal: I've got to dig the book out, but I think read more
  • Marcus Vitruvius: Andrew, That's not surprising. Sad, but not surprising. Of the read more
  • Andrew J. Lazarus: The vast majority of comments at that link are pro-Birther. read more
  • Silverlake Bodhisattva: Re: "I'm just asking the question": "I know those stories read more
  • mark buehner: Maybe now Conservatives will stop slurring liberals as having a read more
  • Marcus Vitruvius: Hear, hear. Schlichter nails it when he says that "I'm read more
The Winds Crew
Town Founder: Left-Hand Man: Other Winds Marshals
  • 'AMac', aka. Marshal Festus (AMac@...)
  • Robin "Straight Shooter" Burk
  • 'Cicero', aka. The Quiet Man (cicero@...)
  • David Blue (david.blue@...)
  • 'Lewy14', aka. Marshal Leroy (lewy14@...)
  • 'Nortius Maximus', aka. Big Tuna (nortius.maximus@...)
Other Regulars Semi-Active: Posting Affiliates Emeritus:
Winds Blogroll
Author Archives
Categories
Powered by Movable Type 4.23-en