|
December 12, 2004Machine Translation and the Global Blogosphereby Tim Oren at December 12, 2004 2:23 AM
(This post is a more formal version, with links, of my rant about machine translation (MT) at the Global Voices session at the Harvard Berkman conference. There's an earlier backgrounder on the marketplace on my home blog, Due Diligence, and also see this general background at Wikipedia.) An obviously valuable addition to the global blogosphere would be automatic language translation. The good news is that more non-English tools like Spirit of America's Arabic blogging tool are popping up, enabling those who don't speak English to join the blogosphere. One downside is the potential for creating language islands , isolating those without bilingual skills. What are the prospects for the blogosphere getting access to state of the art machine translation (MT) technology on reasonable (preferably free) terms. Better than you might think. (Disclaimer: I am not a technical expert on machine translation. I'm a technology investor and analyst, and I invite additions or corrections from real experts in the comments. That said, the earlier post referenced above was a partial record of a fairly detailed diligence project on the domain, and I'm recycling that research here. Details proprietary to any particular company have been deliberately deleted.) Machine translation is actually a small business. It's forecast to be about $100m in 2007, which is tiny as high tech markets go. Meanwhile, the human translation market is about ten billion dollars. MT is 1% of that – pretty pathetic. The reason is simple. Machine translation is actually pretty bad, except when you have no alternative because of the volume of text you must handle. This small market size makes it difficult for would-be MT startups to get capital from people like me, since they have to persuade us that they can profitably take a large amount of market share from entrenched competitors, or radically expand the market. Hard sale, to say the least. The result is that most new MT work is funded by military and intelligence agencies (not just the American ones) which require massive volumes of translation impossible to do with people. The language pairings required by these funders don't necessarily correspond to those the blogosphere might need; unsurprisingly there's a lot of interest right now in Arabic to English, but rather less on languages like Swahili or Malay that were discussed at the panel, and generally less on translating from English. Machine translation technology is also in transition. Like a lot of natural language processing, it's moving from older rule based systems to corpus based techniques. Without details, that's using the increasing power of computers to find statistical patterns in language. What it requires is a huge amount of data to be analyzed before you get any useful output. In the case of a translator, what you need is hundreds of megabytes of parallel, equivalent text in both languages. Gathering that can be very expensive - and remember that these companies have trouble raising money. That leads to things like one project which started by using the proceedings of the Canadian parliament, because they are kept in both French and English by law. Meanwhile, what about the other 99% of the market ($9.9billion worth) – the human translators? That market is largely controlled by translation bureaus, which are increasingly Internet connected, global networks of people with skills in particular language pairs. How do these bureaus differentiate themselves and compete? Their core assets – beyond their human networks – are large databases of parallel texts, which their translators can search for previous similar examples which make their work more efficiently. This part of their business is so important that the costs of software to support these databases, both internally developed or openly sold, is likely nearly as large as the entire MT market. So, do you think these service bureaus are going to make their databases of translations open to the corpus based MT guys, so they can try and take market share? Right... Now let me drag in (by reference) the open source movement and suggest how it might apply to creating mutual benefit for the blogosphere and the MT vendors and their other users. First, where open source is not necessary or perhaps feasible is in the translation technology itself. Unlike something like an eVoting system, the input and output of an MT engine is readily inspected for error by many experts. Second, most of the deep experts in the domain are already busy – what we want to do is make it easier (read, cheaper) for them and their companies to develop more language pairs. So, the asset that really needs to be open are the various collections of parallel texts, and perhaps a database engine that allows them to be gathered, stored, and retrieved. The terms of openness may need to include some variant of Creative Commons that allows these texts to be used to train MT engines which will then be sold, but only if the vendor agrees to some useful giveback to the blogosphere, such as availability of a limited functionality, XML enabled server (or service) that can be plugged into blogging or other community software for free. And the blogosphere's contribution is obviously the translated texts. The good news is we have already have a lot of bilingual, motivated people with plenty to talk about, many of them drawn from the diasporas that were discussed at the panel. What's needed is to help them to organize translation of blog posts in a way that is immediately valuable to their communities, and over time bootstraps the translation repository so that we no longer have to be solely depend on their time and goodwill. Since all machine translation techniques work better when the domain of discourse is known and used to categorized both the training data and text to be translated, there may be some advantage in at least creating hooks for the various categorization and taxonomy proposals floating around the blogging and open source communities. OK, that's hopefully enough to launch the meme. Please tear it up, refine, and propagate. I need to do a little research that might yield more specific action items, which might yield a further post. Tracked: December 16, 2004 9:27 AM
Excerpt: In Andrew Joscelynes Weblog Blogos, zu Sprachtechnologien, ist ein Eintrag auf Englisch über den Gebrauch von Übersetzungssystemen bei Weblogs. Andrew Joscelyne has an entry in Blogos on possibilities for machine translation in weblogs. One s...
Comments
Thanks for the primer on where MT is these days. Used to share offices with a translation bureau (about 2 decades ago) and the first "computer-assisted translations" were beyond funny. But the entire rule-based approach has been a real stumper. It's not just fragments and slang or colloquialisms that present problems -- there's no way to get to any of the "blends" that the cognitive/linguistic types talk about. I'm interested in learning more about some of the issues. Can you point me to info on the sorts of proposals you referred to here. The more I deal with blogs, I find the more the whole area of taxonomies becomes interesting/essential: Much obliged! Nadezha, good question, and I will put together a few followup links, maybe a post. But right now I got back from drinks and a big dinner with AL, Blackfive, Jeff Jarvis, Omar and Mohammed and the Spirit of America folks and I am wasted. Tomorrow is airplane and helping my wife with a holiday gathering, so likely Monday before I get to it, just don't want you to think I'm blowing it off. Brain cells shutting down in 5,4,3....
#3 from Factory at 5:34 am on Dec 12, 2004
Hmm nice idea. "The terms of openness may need to include some variant of Creative Commons that allows these texts to be used to train MT engines which will then be sold, but only if the vendor agrees to some useful giveback to the blogosphere, such as availability of a limited functionality, XML enabled server (or service) that can be plugged into blogging or other community software for free." Hmm that turned into a bit of a waffle. I've worked as a professional technical translator for 30 years and I'll guarantee no software at any price renders a faithful translation. Had I a blog, I would not want it translated by software, because it would surely say things I never intended, some of which would be ******* ******. PacJim: I'm fully aware of the limits of MT, having used it in projects before. Where the blog situation differs from most document translation projects is the presence of people who can almost always work around the flaws of the translation. I've in fact seen this work in a 'social software' setting, almost a decade ago - something I should probably write up. Factory - all good issues, and worth thinking about, particularly in terms of bootstrapping, i.e., what is the smallest useful thing that could be done. There's always a chance of mission creep, but doing actual translation software would probably be the last target, considering the lead time to working software it would imply. As far as licensing terms, IANAL, what I wrote was a requirements statement for someone who is an attorney. ...heading to airport... PacRim Jim: I wouldn't give up hope just yet-- consider the non-trivial [english language] speech recognition problems of the eighties which have been basically solved. Hmm, didn't Robin have some stuff on Star Trek translaters? I'll look for some DARPA/DLI links, but I remember reading about a proposed machine translation contest, like the autonomous vehicle contest earlier this year. I personally think we have to jettison the stodgy old Chomsky comptational linguistics models that we've been stuck with for a half century. :( I think you've missed a couple of points - For one, the translation in blogspace doesn't have to be perfect (as it would if someone were trying to translate some technical report or business contract and correspondance, or whatnot) - as long as the ideas can come through, even somewhat garbled, we can get the gist of what the person was saying - at least enough to have an argument with them, which is the ultimate goal. The other is one of the more important aspects of Open technologies - that's community. I could envision creating technology that enables a community of people, who are trying to translate works back and forth between language pairs, to work collaboratively. This, in concert with the notion of hypertext (where a word doesn't just have its text, but can 'mean' several things on different levels) has been limited in it's role in the web - we can do better. For instance, if I hover over a word in a translation, I might get several varients of the potential meanings of the translation, some definitions, etc... Let me give you a real-life example of how this might play out in an Open Source/Community: let's say I want to translate some article from some website from Arabic to English for the noblest of all reasons - I want to find out what they're saying about me (cough). I speak English. I don't care that the translation is not perfect - nor do I care that the translation output is not proper English and I might have to mouse-over some words or phrases - I'm trying to get the meaning and sense of the article. (This is plenty good enough for blogspace - each person writes in their native language, the reader reads (with the added difficulty of machine translation) in their native language.) First, we let the machines do what they do best - handling lists and generating several provisional translations for each word, making a 'guess' as to which one was right, but that guess doesn't need to be right with hypertext. Then, syntactical analysis allows one to reorder subject/object/verb and related magic (I'm deliberately hand-waving, but a bad translation is tolerable to make the system start working, for reasons referenced below, and ultimately things should get better). The software can make a statistical guess, and allow people who have some knowledge of both languages the ability to define better translations and add that to it's statistical analysis in the future. But, in the more general case, you would let the native speaker in the target language select what makes the most sense - since language has context, it's usually easy enough to go from a syntactical and definition translation to a semantic one. That 'most sense' pick comes back to the server to inform the next set of 'best guess'. Finally, if something doesn't translate right, or a definition isn't found, the english speaker could flag it, and then, some kind-hearted community-spirited multi-lingual folk (or greedy ones - we can probably pay a per phrase charge) can translate the line, and that translation is fed back in to the system, starting the cycle over. This distribution of effort with retention of results would make the system self-propagating and limit the impact of the system on any one person, so as time goes on, things should become better and better through greater use. Obviously, the software folks will continue making the software better/smarter, because that's what they do. The multi-lingual folks will always have a chance to translate or transliterate jargon, and handle euphamism and literary license, and make suggestions for better handling of certain syntactical constructs. Frankly, and I've recommended this to them, this is a job for Google. Or the State Department. Or both.
#8 from lewy14 at 12:59 am on Dec 13, 2004
Tim, all, some ideas: For a minimal first step, how about an RSS tag element to indicate that “this post is a translation from language [source] to language [dest] of the post at [url]”. This would allow bots which harvest translations to monitor RSS feeds to get new translations for the purpose of accumulating a corpus. Some HTML scraping tools would be required in order to actually harvest both of the posts. The point about categorization and taxonomy is dead on. Consider: while speech recognition has indeed progressed substantially (as jinnderella indicated), natural language understanding has “strong AI” requirements (it’s the “understanding” that’s really hard). Real robust MT will likely require strong natural language understanding. This is A Ways Off (at least it is unwise to count on it soon.). Further, even with strong understanding, domain context is crucial to creating decent translations. I’m a regular user of web based machine translation – I read the bike racing news from French sources. My knowledge of French is very poor, but when combined with a poor machine translation and my knowledge of the domain (bike racing), is sufficient to render a good translation – likely better, in fact, than a translation from someone who was an expert in French but not bike racing. The idiomatic mappings are sufficiently regular that an approach which combined categorization (based on machine learning / statistical computing) and corpus based machine translation (also based on machine learning / statistical computing) could create very good translations, I believe. Finally, as indicated above, a human with poor language skills but decent domain expertise can use poor quality MT to create a decent translation, which could be reposted (to be read by humans or used to train a corpus based MT system). Categorization and taxonomy software could be used to “route” posts to the domain experts for translation. In fact, this kind of “taxonomy based routing” could make a boffo service in itself, and may be a great stand-alone “first step”. Taxonomy / classification systems which use Machine Learning also require a corpus of material. This corpus (the “training set”) must be hand categorized by topic. The categorization community uses several such training sets to benchmark their algorithms; on major one is a corpus of Reuters news stories from the 80’s. I’d suggest that one modern and relevant corpus might be the archive of “Winds of Change” itself, having a substantial volume and a detailed taxonomy. Same for other blogs which keep a detailed classification system for their archives. Tim, I’m not a MT guru but I have some small knowledge of (and substantial interest in) the classification / taxonomy space, and some (IMHO) cool ideas in that space. I’d be very interested in a follow up post about “the various categorization and taxonomy proposals floating around the blogging and open source communities.” Personally I’m casting about for an OSS project to contribute my modest skills to. Color me interested in helping out. Good to meet you in silly valley today, Tim (as well as, of course, Omar & Mo). The Arabic blogging tool sounds great! That's great. But you know, machine translation has its own disadvantages, the main is - a machine usually makes a lot of mistakes in translation, which can cause misunderstanding and even serious consequenses.
#11 from Bradleyjames at 7:37 am on Jun 10, 2008
[Spam. Kapow. --NM]
Post a comment
Here are some quick tips for adding simple Textile formatting to your comments, though you can also use proper HTML tags: |
You're Reading an Individual Post!
If you want to head to the main blog page, just follow the "Main" link in the navigation up top underneath our blog's name. Or click here:
Winds of Change.NET Home
Project Valour-IT
Winds of Change Library
Recent Entries
· Political Weenie Report: Why Cure a White Male Disease?
· What If An Ad Agency Created the Stop Sign? · Stupid Government Tricks: Carpooling in Ontario · The Psychology of the Con · Thanksgiving Morning 2008 · Project Valour-IT · Winning In Afghanistan: A British View · Changing Winds · TAREK VERSUS TARIQ · Turkey · Hoder in Jail in Iran · Obama's Web 2.0 Communication Strategy · The Next Tech Boom? · Prince Charles: Defender Of Nothing In Particular · The Australian Sex Party
Support Winds of Change.NET!
Your support & assistance is greatly appreciated, and makes a difference!
The Winds Crew:
Town Founder: Joe Katzman joe {at} windsofchange. net Joe's Normblog Interview Left-Hand Man: Marc 'Armed Liberal' Danziger armed {at} windsofchange. net A.L.'s Normblog Interview Other Winds Marshals 'AMac', aka. Marshal Festus (AMac@...) Robin "Straight Shooter" Burk 'Cicero', aka. The Quiet Man (cicero@...) David Blue (david.blue@...) 'Lewy14', aka. Marshal Leroy (lewy14@...) 'Nortius Maximus', aka. Big Tuna (nortius.maximus@...) Other Regulars 'Callimachus' (callimachus@...) 'Demosophist' (demosophist@...) Rev./Maj. Donald Sensing 'Molon Labe' (molon.labe@...) 'Neo Neo-Con' Tarek Heggy (tarek@...) Semi-Active: Arthur Chrenkoff 'Gabriel Gonzalez' (in Paris) Tim Oren (tim@...) Trent Telenko (trent@...) Posting Affiliates Athena: Terrorism Unveiled Chester: The Adventures of Chester Dave Schuler: The Glittering Eye Grim: Grim's Lair et. al. Joel Gaines [Russia] Michael Totten MILblogging.com: The MilBlogs directory Murdoc [Military] Situational Awareness team [Military] Nathan Hamm [Central Asia] Randy Paul [Latin America] Robert Koehler [Koreas] Robi Sen [India & S. Asia] Nitin Pai [India & S. Asia] Simon [China & E. Asia] Yehudit: Kesher Talk Emeritus: Adil Farooq (adil@...) Andrew Olmsted [KIA, Iraq] Celeste Bilby (celeste@...) Dan Darling Gary Farber (gary@...) Hossein Derakhshan (hoder@...) T.L. James (tljames@...) Robin Burk (robin@...)
Winds of Change.NET Blogkids & Affiliates
· The Argus: covering Central Asia · Canis Iratus: Glen Wishard · Correct-Amundo: Tech & society · Discarded Lies: Ev & Zorkie · The Flying Kiwi: Donovan Janus · The Glittering Eye: Dave Schuler · Gumptionology: Nortius Maximus · Hot Needle of Inquiry: 'Jinnderella' · Laughing Wolf: C. Blake Powers · Out The Mazoo: 'Mazoo' · Power and Control: M. Simon · Praktike's Place: 'Praktike' · Random Probabilities: Robin Burk · Siberian Light: covering Russia · The Spirit of Man · Good News From the Front · WATCH/: covering the war on terror
Archives By Category
-FEATURES: 48 Ways to Wisdom (24)
-FEATURES: Diaries & Roundups (10) -FEATURES: Military Transformation Uplink (12) -FEATURES: New Energy Currents (20) -FEATURES: Reader Highlights (2) -FEATURES: Regional Briefings (166) -FEATURES: Sufi Wisdom (158) -FEATURES: The Bard's Breath (32) -FEATURES: Winds of Discovery (6) -FEATURES: Winds of War [WoT] (445) 4 HA: 4th-Gen Warfare (103) 4 HA: al-Qaeda (159) 4 HA: Crime, Organized (26) 4 HA: Evil Exists (111) 4 HA: Intelligence/Spycraft (100) 4 HA: Military (531) 4 HA: Nukes, Poisons, Germs (136) 4 HA: Statecraft (29) 4 HA: War on Terror articles (709) Best Of... (180) BIZ: Business & Organizations (137) BIZ: Economics (103) BIZ: Energy (75) CIVIS (236) CIVIS: Copyright Wars (25) CIVIS: Drug Wars (18) CIVIS: Edu-Kooks (76) CIVIS: Free Societies (296) CIVIS: Hall of Shame (164) CIVIS: Hatred Rising (114) CIVIS: Journalism & Media (413) CIVIS: Spirit of America.NET (32) CIVIS: War Within the West (313) COLUMNISTS: M. Simon (13) COLUMNISTS: Tarek Heggy (33) GEO: Afghanistan (80) GEO: Africa (104) GEO: Asia (117) GEO: Aussies & Kiwis (22) GEO: Canada (72) GEO: China (87) GEO: Europe (183) GEO: France (71) GEO: India-Pakistan (113) GEO: Iran (224) GEO: Iraq (967) GEO: Israel (249) GEO: Koreas (64) GEO: Latin America (63) GEO: Middle East (257) GEO: Russia (83) GEO: Saudi Arabia (64) GEO: Sudan (36) GEO: U.K. (71) GEO: U.N. (61) GEO: U.S. of A (506) HUMANITY (89) HUMANITY: Art & Culture (161) HUMANITY: Art - Music (32) HUMANITY: Art - Poetry (6) HUMANITY: Christianity (53) HUMANITY: Heroes & Achievements (232) HUMANITY: History (126) HUMANITY: Islam (183) HUMANITY: Judaism (137) HUMANITY: Love (32) HUMANITY: Philosophy (49) HUMANITY: Spirituality & Religion (74) HUMANITY: Zen & Buddhism (28) Humour (200) Misc. (44) NET: Blogosphere (397) NET: Cyber-Security (16) NET: Grid Computing (3) NET: Spam (24) NET: The Internet (39) NET: The Open Source Meme (18) Personal (199) SCI-TECH: Biotech & Medical (84) SCI-TECH: Eco-tech (82) SCI-TECH: Nanotech (27) SCI-TECH: Science (113) SCI-TECH: Space (75) SCI-TECH: Technology (146) SPORTS (45) SPORTS: Baseball (76) Trends (66) USA: America Catch-all (20) USA: Anti-Americanism (6) USA: California Politics (16) USA: Conservatives & GOP (43) USA: Dem Party Renewal (77) USA: Domestic Issues (56) USA: Elections (132) USA: Grand Strategy (15) USA: Homeland Security (106) VictoryPAC (3) Winds of Change.NET (56)
Archives by Date
December 2008
November 2008 October 2008 September 2008 August 2008 July 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006 May 2006 April 2006 March 2006 February 2006 January 2006 December 2005 November 2005 October 2005 September 2005 August 2005 July 2005 June 2005 May 2005 April 2005 March 2005 February 2005 January 2005 December 2004 November 2004 October 2004 September 2004 August 2004 July 2004 June 2004 May 2004 April 2004 March 2004 February 2004 January 2004 December 2003 November 2003 October 2003 September 2003 August 2003 July 2003 June 2003 May 2003 April 2003 March 2003 February 2003 January 2003 November 2002 October 2002 September 2002 August 2002 July 2002 June 2002 May 2002 April 2002 Joe's Old Archives, By Title: April - June 2002 July - December 2002
Winds Blogroll
Top Prospects
SP Normblog (LHP) SP Solomonia (RHP) RF Mader Blog CF Donklephant LF Harry's Place C Critical Mass 1B Tigerhawk 2B Gideon's Blog SS Alexander the Average 3B Democracy Arsenal UT INF Pundita DH Counterterrorism Blog PEN Liberals Against Terrorism CL Gates of Vienna MASCOT Huffington's Toast MGR Robert Tagorda GM Conservative Grapevine Humour Blogs
Support VictoryPAC· Cox & Forkum (cartoons) · Day By Day (cartoons) · User Friendly (cartoons) · Iowahawk (satire) · Scrappleface (satire) Religious Blogs · Conscientia (baha'i) · Unlearned Hand (bud) · Eve Tushnet (cath) · Muslim Under Progress (isl) · Ideofact (isl) · Kesher Talk (jew) · Rabbi Lazer Brody (jew) · Rishon Rishon (jew) · Rev. Donald Sensing (prot) Other Team Memberships · AlwaysOn [JK] · Blogcritics.org [JK] · Tech Central Station [JK] Blog Services< · NZ Bear's Ecosystem · Blogstreet · Daypop Top 40 · Technorati · Movable Type.org · New York Times Permalinks · Write A Better Blog |
http://www.windsofchange.net/windsopcentre-cms/trackback.cgi/3765
Listed below are links to weblogs that reference
"Machine Translation and the Global Blogosphere"