Winds of Change.NET: Liberty. Discovery. Humanity. Victory.

Formal Affiliations
  • Anti-Idiotarian Manifesto
  • Euston Democratic Progressive Manifesto
  • Real Democracy for Iran!
  • Support Denamrk
  • Million Voices for Darfur
  • milblogs
Syndication
 Subscribe in a reader

Machine Translation and the Global Blogosphere

| 11 Comments | 1 TrackBack

(This post is a more formal version, with links, of my rant about machine translation (MT) at the Global Voices session at the Harvard Berkman conference. There's an earlier backgrounder on the marketplace on my home blog, Due Diligence, and also see this general background at Wikipedia.)

An obviously valuable addition to the global blogosphere would be automatic language translation. The good news is that more non-English tools like Spirit of America's Arabic blogging tool are popping up, enabling those who don't speak English to join the blogosphere. One downside is the potential for creating language islands , isolating those without bilingual skills. What are the prospects for the blogosphere getting access to state of the art machine translation (MT) technology on reasonable (preferably free) terms. Better than you might think.

(Disclaimer: I am not a technical expert on machine translation. I'm a technology investor and analyst, and I invite additions or corrections from real experts in the comments. That said, the earlier post referenced above was a partial record of a fairly detailed diligence project on the domain, and I'm recycling that research here. Details proprietary to any particular company have been deliberately deleted.)

Machine translation is actually a small business. It's forecast to be about $100m in 2007, which is tiny as high tech markets go. Meanwhile, the human translation market is about ten billion dollars. MT is 1% of that – pretty pathetic. The reason is simple. Machine translation is actually pretty bad, except when you have no alternative because of the volume of text you must handle. This small market size makes it difficult for would-be MT startups to get capital from people like me, since they have to persuade us that they can profitably take a large amount of market share from entrenched competitors, or radically expand the market. Hard sale, to say the least.

The result is that most new MT work is funded by military and intelligence agencies (not just the American ones) which require massive volumes of translation impossible to do with people. The language pairings required by these funders don't necessarily correspond to those the blogosphere might need; unsurprisingly there's a lot of interest right now in Arabic to English, but rather less on languages like Swahili or Malay that were discussed at the panel, and generally less on translating from English.

Machine translation technology is also in transition. Like a lot of natural language processing, it's moving from older rule based systems to corpus based techniques. Without details, that's using the increasing power of computers to find statistical patterns in language. What it requires is a huge amount of data to be analyzed before you get any useful output. In the case of a translator, what you need is hundreds of megabytes of parallel, equivalent text in both languages. Gathering that can be very expensive - and remember that these companies have trouble raising money. That leads to things like one project which started by using the proceedings of the Canadian parliament, because they are kept in both French and English by law.

Meanwhile, what about the other 99% of the market ($9.9billion worth) – the human translators? That market is largely controlled by translation bureaus, which are increasingly Internet connected, global networks of people with skills in particular language pairs. How do these bureaus differentiate themselves and compete? Their core assets – beyond their human networks – are large databases of parallel texts, which their translators can search for previous similar examples which make their work more efficiently. This part of their business is so important that the costs of software to support these databases, both internally developed or openly sold, is likely nearly as large as the entire MT market.

So, do you think these service bureaus are going to make their databases of translations open to the corpus based MT guys, so they can try and take market share? Right...

Now let me drag in (by reference) the open source movement and suggest how it might apply to creating mutual benefit for the blogosphere and the MT vendors and their other users.

First, where open source is not necessary or perhaps feasible is in the translation technology itself. Unlike something like an eVoting system, the input and output of an MT engine is readily inspected for error by many experts. Second, most of the deep experts in the domain are already busy – what we want to do is make it easier (read, cheaper) for them and their companies to develop more language pairs.

So, the asset that really needs to be open are the various collections of parallel texts, and perhaps a database engine that allows them to be gathered, stored, and retrieved. The terms of openness may need to include some variant of Creative Commons that allows these texts to be used to train MT engines which will then be sold, but only if the vendor agrees to some useful giveback to the blogosphere, such as availability of a limited functionality, XML enabled server (or service) that can be plugged into blogging or other community software for free.

And the blogosphere's contribution is obviously the translated texts. The good news is we have already have a lot of bilingual, motivated people with plenty to talk about, many of them drawn from the diasporas that were discussed at the panel. What's needed is to help them to organize translation of blog posts in a way that is immediately valuable to their communities, and over time bootstraps the translation repository so that we no longer have to be solely depend on their time and goodwill.

Since all machine translation techniques work better when the domain of discourse is known and used to categorized both the training data and text to be translated, there may be some advantage in at least creating hooks for the various categorization and taxonomy proposals floating around the blogging and open source communities.

OK, that's hopefully enough to launch the meme. Please tear it up, refine, and propagate. I need to do a little research that might yield more specific action items, which might yield a further post.

1 TrackBack

Tracked: December 16, 2004 9:27 AM
Excerpt: In Andrew Joscelynes Weblog Blogos, zu Sprachtechnologien, ist ein Eintrag auf Englisch über den Gebrauch von Übersetzungssystemen bei Weblogs. Andrew Joscelyne has an entry in Blogos on possibilities for machine translation in weblogs. One s...

11 Comments

Thanks for the primer on where MT is these days. Used to share offices with a translation bureau (about 2 decades ago) and the first "computer-assisted translations" were beyond funny. But the entire rule-based approach has been a real stumper. It's not just fragments and slang or colloquialisms that present problems -- there's no way to get to any of the "blends" that the cognitive/linguistic types talk about.

I'm interested in learning more about some of the issues. Can you point me to info on the sorts of proposals you referred to here. The more I deal with blogs, I find the more the whole area of taxonomies becomes interesting/essential:
Since all machine translation techniques work better when the domain of discourse is known and used to categorized both the training data and text to be translated, there may be some advantage in at least creating hooks for the various categorization and taxonomy proposals floating around the blogging and open source communities.

Much obliged!

Nadezha, good question, and I will put together a few followup links, maybe a post. But right now I got back from drinks and a big dinner with AL, Blackfive, Jeff Jarvis, Omar and Mohammed and the Spirit of America folks and I am wasted. Tomorrow is airplane and helping my wife with a holiday gathering, so likely Monday before I get to it, just don't want you to think I'm blowing it off. Brain cells shutting down in 5,4,3....

Hmm nice idea.
I can see a few problems (and there is alot of IMHO here):
1) The opensource community maybe this industries worst enemy. Once an opensource project gets going it tends to get a bad case of mission growth, every feature under the sun will be included, and new sub projects will form which basically support the main project. (Cite: Wikipedia, Apache, Linux, etc)
If a general machine translation project gets going, expect it to eventually try to do every in the general area of machine translation, industry would have to either compete, or stay in niche markets (which opensource doesn't seem to do so well).
2) Opensource projects are very long term, and generally quite slow. From the moment something is first done on the project, to the time when it starts to become a 'large' (and useful, one might imagine) project could easily be measured in years. OTOH 'large' OS projects will prolly last decades. (Cite: Emacs, gcc, have both being going for 20odd years, Emacs predated the GPL)
3) OS projects are run by doers, not planners. The founding members will be doing all the grunt work for the start of the project, the ppl that will form the basis of the development group in a mature project will come from the userbase, the only way get a userbase is by developing a useful project, and that takes alot of work.

"The terms of openness may need to include some variant of Creative Commons that allows these texts to be used to train MT engines which will then be sold, but only if the vendor agrees to some useful giveback to the blogosphere, such as availability of a limited functionality, XML enabled server (or service) that can be plugged into blogging or other community software for free."
I can't see this working. I could see it working if the translating volunteers were providing data to an existing OS project, but in that setup it gives the company running the engine too much power, the project would be too much of a cog in the companies machine.

Hmm that turned into a bit of a waffle.

I've worked as a professional technical translator for 30 years and I'll guarantee no software at any price renders a faithful translation. Had I a blog, I would not want it translated by software, because it would surely say things I never intended, some of which would be ******* ******.

PacJim: I'm fully aware of the limits of MT, having used it in projects before. Where the blog situation differs from most document translation projects is the presence of people who can almost always work around the flaws of the translation. I've in fact seen this work in a 'social software' setting, almost a decade ago - something I should probably write up.

Factory - all good issues, and worth thinking about, particularly in terms of bootstrapping, i.e., what is the smallest useful thing that could be done. There's always a chance of mission creep, but doing actual translation software would probably be the last target, considering the lead time to working software it would imply. As far as licensing terms, IANAL, what I wrote was a requirements statement for someone who is an attorney.

...heading to airport...

PacRim Jim: I wouldn't give up hope just yet-- consider the non-trivial [english language] speech recognition problems of the eighties which have been basically solved. Hmm, didn't Robin have some stuff on Star Trek translaters?

I'll look for some DARPA/DLI links, but I remember reading about a proposed machine translation contest, like the autonomous vehicle contest earlier this year.

I personally think we have to jettison the stodgy old Chomsky comptational linguistics models that we've been stuck with for a half century. :(

I think you've missed a couple of points -

For one, the translation in blogspace doesn't have to be perfect (as it would if someone were trying to translate some technical report or business contract and correspondance, or whatnot) - as long as the ideas can come through, even somewhat garbled, we can get the gist of what the person was saying - at least enough to have an argument with them, which is the ultimate goal.

The other is one of the more important aspects of Open technologies - that's community.

I could envision creating technology that enables a community of people, who are trying to translate works back and forth between language pairs, to work collaboratively.

This, in concert with the notion of hypertext (where a word doesn't just have its text, but can 'mean' several things on different levels) has been limited in it's role in the web - we can do better. For instance, if I hover over a word in a translation, I might get several varients of the potential meanings of the translation, some definitions, etc...

Let me give you a real-life example of how this might play out in an Open Source/Community: let's say I want to translate some article from some website from Arabic to English for the noblest of all reasons - I want to find out what they're saying about me (cough).

I speak English. I don't care that the translation is not perfect - nor do I care that the translation output is not proper English and I might have to mouse-over some words or phrases - I'm trying to get the meaning and sense of the article. (This is plenty good enough for blogspace - each person writes in their native language, the reader reads (with the added difficulty of machine translation) in their native language.)

First, we let the machines do what they do best - handling lists and generating several provisional translations for each word, making a 'guess' as to which one was right, but that guess doesn't need to be right with hypertext. Then, syntactical analysis allows one to reorder subject/object/verb and related magic (I'm deliberately hand-waving, but a bad translation is tolerable to make the system start working, for reasons referenced below, and ultimately things should get better).

The software can make a statistical guess, and allow people who have some knowledge of both languages the ability to define better translations and add that to it's statistical analysis in the future.

But, in the more general case, you would let the native speaker in the target language select what makes the most sense - since language has context, it's usually easy enough to go from a syntactical and definition translation to a semantic one. That 'most sense' pick comes back to the server to inform the next set of 'best guess'.

Finally, if something doesn't translate right, or a definition isn't found, the english speaker could flag it, and then, some kind-hearted community-spirited multi-lingual folk (or greedy ones - we can probably pay a per phrase charge) can translate the line, and that translation is fed back in to the system, starting the cycle over.

This distribution of effort with retention of results would make the system self-propagating and limit the impact of the system on any one person, so as time goes on, things should become better and better through greater use.

Obviously, the software folks will continue making the software better/smarter, because that's what they do. The multi-lingual folks will always have a chance to translate or transliterate jargon, and handle euphamism and literary license, and make suggestions for better handling of certain syntactical constructs.

Frankly, and I've recommended this to them, this is a job for Google. Or the State Department. Or both.

Tim, all, some ideas:

For a minimal first step, how about an RSS tag element to indicate that “this post is a translation from language [source] to language [dest] of the post at [url]”. This would allow bots which harvest translations to monitor RSS feeds to get new translations for the purpose of accumulating a corpus. Some HTML scraping tools would be required in order to actually harvest both of the posts.

The point about categorization and taxonomy is dead on.

Consider: while speech recognition has indeed progressed substantially (as jinnderella indicated), natural language understanding has “strong AI” requirements (it’s the “understanding” that’s really hard). Real robust MT will likely require strong natural language understanding. This is A Ways Off (at least it is unwise to count on it soon.).

Further, even with strong understanding, domain context is crucial to creating decent translations. I’m a regular user of web based machine translation – I read the bike racing news from French sources. My knowledge of French is very poor, but when combined with a poor machine translation and my knowledge of the domain (bike racing), is sufficient to render a good translation – likely better, in fact, than a translation from someone who was an expert in French but not bike racing. The idiomatic mappings are sufficiently regular that an approach which combined categorization (based on machine learning / statistical computing) and corpus based machine translation (also based on machine learning / statistical computing) could create very good translations, I believe.

Finally, as indicated above, a human with poor language skills but decent domain expertise can use poor quality MT to create a decent translation, which could be reposted (to be read by humans or used to train a corpus based MT system). Categorization and taxonomy software could be used to “route” posts to the domain experts for translation. In fact, this kind of “taxonomy based routing” could make a boffo service in itself, and may be a great stand-alone “first step”. Taxonomy / classification systems which use Machine Learning also require a corpus of material. This corpus (the “training set”) must be hand categorized by topic. The categorization community uses several such training sets to benchmark their algorithms; on major one is a corpus of Reuters news stories from the 80’s. I’d suggest that one modern and relevant corpus might be the archive of “Winds of Change” itself, having a substantial volume and a detailed taxonomy. Same for other blogs which keep a detailed classification system for their archives.

Tim, I’m not a MT guru but I have some small knowledge of (and substantial interest in) the classification / taxonomy space, and some (IMHO) cool ideas in that space. I’d be very interested in a follow up post about “the various categorization and taxonomy proposals floating around the blogging and open source communities.” Personally I’m casting about for an OSS project to contribute my modest skills to. Color me interested in helping out.

Good to meet you in silly valley today, Tim (as well as, of course, Omar & Mo). The Arabic blogging tool sounds great!

That's great. But you know, machine translation has its own disadvantages, the main is - a machine usually makes a lot of mistakes in translation, which can cause misunderstanding and even serious consequenses.

[Spam. Kapow. --NM]

Leave a comment

Here are some quick tips for adding simple Textile formatting to your comments, though you can also use proper HTML tags:

*This* puts text in bold.

_This_ puts text in italics.

bq. This "bq." at the beginning of a paragraph, flush with the left hand side and with a space after it, is the code to indent one paragraph of text as a block quote.

To add a live URL, "Text to display":http://windsofchange.net/ (no spaces between) will show up as Text to display. Always use this for links - otherwise you will screw up the columns on our main blog page.




Recent Comments
  • TM Lutas: Jobs' formula was simple enough. Passionately care about your users, read more
  • sabinesgreenp.myopenid.com: Just seeing the green community in action makes me confident read more
  • Glen Wishard: Jobs was on the losing end of competition many times, read more
  • Chris M: Thanks for the great post, Joe ... linked it on read more
  • Joe Katzman: Collect them all! Though the French would be upset about read more
  • Glen Wishard: Now all the Saudis need is a division's worth of read more
  • mark buehner: Its one thing to accept the Iranians as an ally read more
  • J Aguilar: Saudis were around here (Spain) a year ago trying the read more
  • Fred: Good point, brutality didn't work terribly well for the Russians read more
  • mark buehner: Certainly plausible but there are plenty of examples of that read more
  • Fred: They have no need to project power but have the read more
  • mark buehner: Good stuff here. The only caveat is that a nuclear read more
  • Ian C.: OK... Here's the problem. Perceived relevance. When it was 'Weapons read more
  • Marcus Vitruvius: Chris, If there were some way to do all these read more
  • Chris M: Marcus Vitruvius, I'm surprised by your comments. You're quite right, read more
The Winds Crew
Town Founder: Left-Hand Man: Other Winds Marshals
  • 'AMac', aka. Marshal Festus (AMac@...)
  • Robin "Straight Shooter" Burk
  • 'Cicero', aka. The Quiet Man (cicero@...)
  • David Blue (david.blue@...)
  • 'Lewy14', aka. Marshal Leroy (lewy14@...)
  • 'Nortius Maximus', aka. Big Tuna (nortius.maximus@...)
Other Regulars Semi-Active: Posting Affiliates Emeritus:
Winds Blogroll
Author Archives
Categories
Powered by Movable Type 4.23-en