Winds of Change.NET: Liberty. Discovery. Humanity. Victory.

Formal Affiliations
  • Anti-Idiotarian Manifesto
  • Euston Democratic Progressive Manifesto
  • Real Democracy for Iran!
  • Support Denamrk
  • Million Voices for Darfur
  • milblogs
Syndication
 Subscribe in a reader

An Update on Multilingual and Translated Blogs

| 1 Comment

Following the Harvard Berkman conference, I suggested possible synergy between multilingual blogging and machine translation, and later asked for information about Arabic/English machine translation vendors. I promised a report-back in the latter, and here it is, along with related information regarding multilingual blogging platforms and a nascent effort to define minimal specs for tagging translated posts. I've been posting the raw information on my home blog as it has developed, so this will be a summary of findings and discussions. And if you think this is an attempt to get others interested in contributing, you're right.

Multilingual Blogging Platforms

Though this topic started in the Arabic realm, really any platform capable of handling two-byte characters should be adaptable if its interface is translated. Blogger and SixApart will follow their commercial imperatives in this regard, so I went looking for open source options to supplement the Spirit of America Arabic blog project.

For a scalable hosted service, the leading option is the open source version of LiveJournal, which has several translation projects underway. A few days after that post, SixApart swooped in and bought the commercial LiveJournal service, so Ben and Mena now have access to that code base along with MT and Typepad. For a standalone UTF-8 capable blog, the open source leader appears to be Mambo, which also does a number of other content management tricks, and appears to already have one Arabic translation in existence.

Arabic/English Machine Translation

Thanks to commenters here and on Due Diligence, to the AAMT Compendium, and several U.S. Government sources, I've come up with six sources of Arabic/English translation that appear to have networkable products, all private companies scattered between the US, UK, Europe and the Middle East.

If you just want to check them out directly, they are: Half of these vendors once had free Web based translation sites, noted in my survey article, but every one of them has recently gone for-pay. Wonder why? Finally, the ubiquitous Systran has started on an Arabic project, getting some of the bill paid by the EU, but the quality appears to be lacking so far.

Markup For Translated Posts

Following a good suggestion by Winds contributor Lewy14 to the effect that a minimal markup to denote translation would be good idea, I put up a proposed set of requirements for same. Meanwhile, Luke Razzell had made a similar suggestion. Both were noted by Kevin Marks, who suggested most of the goals could be accomplished in existing HTML. In parallel, Lewy14 and I had been discussing things in e-mail, and that dialog, interpolated with Kevin's relevant input has been posted here, with some of the more obscure issues in a second post. The comments to the first post are the best summary of where we stand at the moment. To summarize: it looks like something useful can be done within the bounds of current W3C specs, but with some serious limits on ability to denote individual translated posts within a document. We're looking for more input, and likely for followup in the form of people willing to tweak templates for existing blogging systems to accomodate the results.

Rosetta Bots

Here's the place to acknowledge a genius bit of coinage by Lewy: A Rosetta bot is a crawler that collects parallel, translated texts from the blogosphere, to build a translation database to facilitate further human translations, or to create a training database for a corpus based machine translator. The type of markup we are exploring will allow human or machine translators of blog posts to precisely denote the relationship to the original, and should do something useful in conventional HTML browsers. But an important secondary use may be as a cue for Rosetta bots, which should be easily derived from existing crawlers/HTML scrapers such as Technorati.

1 Comment

Wow, I love the Rosetta bot! Lewy should trademark it. :)
Here's Marti Hearst on the Search Problem:

Another development in the field of computational linguistics is the manual creation of enormous lexical ontologies, which are then used to build axioms and rules about language use. These modern ontologies, unlike their predecessors, are of a large enough scale and simple enough design to be useful, although this work is in the early stages. There are also many attempts to build such ontologies automatically from large text collections; the most promising approach seems to be to combine the automated and the manual approaches.

I wanna buy stock in your and lewy's company. :)

Leave a comment

Here are some quick tips for adding simple Textile formatting to your comments, though you can also use proper HTML tags:

*This* puts text in bold.

_This_ puts text in italics.

bq. This "bq." at the beginning of a paragraph, flush with the left hand side and with a space after it, is the code to indent one paragraph of text as a block quote.

To add a live URL, "Text to display":http://windsofchange.net/ (no spaces between) will show up as Text to display. Always use this for links - otherwise you will screw up the columns on our main blog page.




Recent Comments
  • TM Lutas: Jobs' formula was simple enough. Passionately care about your users, read more
  • sabinesgreenp.myopenid.com: Just seeing the green community in action makes me confident read more
  • Glen Wishard: Jobs was on the losing end of competition many times, read more
  • Chris M: Thanks for the great post, Joe ... linked it on read more
  • Joe Katzman: Collect them all! Though the French would be upset about read more
  • Glen Wishard: Now all the Saudis need is a division's worth of read more
  • mark buehner: Its one thing to accept the Iranians as an ally read more
  • J Aguilar: Saudis were around here (Spain) a year ago trying the read more
  • Fred: Good point, brutality didn't work terribly well for the Russians read more
  • mark buehner: Certainly plausible but there are plenty of examples of that read more
  • Fred: They have no need to project power but have the read more
  • mark buehner: Good stuff here. The only caveat is that a nuclear read more
  • Ian C.: OK... Here's the problem. Perceived relevance. When it was 'Weapons read more
  • Marcus Vitruvius: Chris, If there were some way to do all these read more
  • Chris M: Marcus Vitruvius, I'm surprised by your comments. You're quite right, read more
The Winds Crew
Town Founder: Left-Hand Man: Other Winds Marshals
  • 'AMac', aka. Marshal Festus (AMac@...)
  • Robin "Straight Shooter" Burk
  • 'Cicero', aka. The Quiet Man (cicero@...)
  • David Blue (david.blue@...)
  • 'Lewy14', aka. Marshal Leroy (lewy14@...)
  • 'Nortius Maximus', aka. Big Tuna (nortius.maximus@...)
Other Regulars Semi-Active: Posting Affiliates Emeritus:
Winds Blogroll
Author Archives
Categories
Powered by Movable Type 4.23-en