(This post is a more formal version, with links, of my rant about machine translation (MT) at the Global Voices session at the Harvard Berkman conference. There's an earlier backgrounder on the marketplace on my home blog, Due Diligence, and also see this general background at Wikipedia.)
An obviously valuable addition to the global blogosphere would be automatic language translation. The good news is that more non-English tools like Spirit of America's Arabic blogging tool are popping up, enabling those who don't speak English to join the blogosphere. One downside is the potential for creating language islands , isolating those without bilingual skills. What are the prospects for the blogosphere getting access to state of the art machine translation (MT) technology on reasonable (preferably free) terms. Better than you might think.
(Disclaimer: I am not a technical expert on machine translation. I'm a technology investor and analyst, and I invite additions or corrections from real experts in the comments. That said, the earlier post referenced above was a partial record of a fairly detailed diligence project on the domain, and I'm recycling that research here. Details proprietary to any particular company have been deliberately deleted.)
Machine translation is actually a small business. It's forecast to be about $100m in 2007, which is tiny as high tech markets go. Meanwhile, the human translation market is about ten billion dollars. MT is 1% of that – pretty pathetic. The reason is simple. Machine translation is actually pretty bad, except when you have no alternative because of the volume of text you must handle. This small market size makes it difficult for would-be MT startups to get capital from people like me, since they have to persuade us that they can profitably take a large amount of market share from entrenched competitors, or radically expand the market. Hard sale, to say the least.
The result is that most new MT work is funded by military and intelligence agencies (not just the American ones) which require massive volumes of translation impossible to do with people. The language pairings required by these funders don't necessarily correspond to those the blogosphere might need; unsurprisingly there's a lot of interest right now in Arabic to English, but rather less on languages like Swahili or Malay that were discussed at the panel, and generally less on translating from English.
Machine translation technology is also in transition. Like a lot of natural language processing, it's moving from older rule based systems to corpus based techniques. Without details, that's using the increasing power of computers to find statistical patterns in language. What it requires is a huge amount of data to be analyzed before you get any useful output. In the case of a translator, what you need is hundreds of megabytes of parallel, equivalent text in both languages. Gathering that can be very expensive - and remember that these companies have trouble raising money. That leads to things like one project which started by using the proceedings of the Canadian parliament, because they are kept in both French and English by law.
Meanwhile, what about the other 99% of the market ($9.9billion worth) – the human translators? That market is largely controlled by translation bureaus, which are increasingly Internet connected, global networks of people with skills in particular language pairs. How do these bureaus differentiate themselves and compete? Their core assets – beyond their human networks – are large databases of parallel texts, which their translators can search for previous similar examples which make their work more efficiently. This part of their business is so important that the costs of software to support these databases, both internally developed or openly sold, is likely nearly as large as the entire MT market.
So, do you think these service bureaus are going to make their databases of translations open to the corpus based MT guys, so they can try and take market share? Right...
Now let me drag in (by reference) the open source movement and suggest how it might apply to creating mutual benefit for the blogosphere and the MT vendors and their other users.
First, where open source is not necessary or perhaps feasible is in the translation technology itself. Unlike something like an eVoting system, the input and output of an MT engine is readily inspected for error by many experts. Second, most of the deep experts in the domain are already busy – what we want to do is make it easier (read, cheaper) for them and their companies to develop more language pairs.
So, the asset that really needs to be open are the various collections of parallel texts, and perhaps a database engine that allows them to be gathered, stored, and retrieved. The terms of openness may need to include some variant of Creative Commons that allows these texts to be used to train MT engines which will then be sold, but only if the vendor agrees to some useful giveback to the blogosphere, such as availability of a limited functionality, XML enabled server (or service) that can be plugged into blogging or other community software for free.
And the blogosphere's contribution is obviously the translated texts. The good news is we have already have a lot of bilingual, motivated people with plenty to talk about, many of them drawn from the diasporas that were discussed at the panel. What's needed is to help them to organize translation of blog posts in a way that is immediately valuable to their communities, and over time bootstraps the translation repository so that we no longer have to be solely depend on their time and goodwill.
Since all machine translation techniques work better when the domain of discourse is known and used to categorized both the training data and text to be translated, there may be some advantage in at least creating hooks for the various categorization and taxonomy proposals floating around the blogging and open source communities.
OK, that's hopefully enough to launch the meme. Please tear it up, refine, and propagate. I need to do a little research that might yield more specific action items, which might yield a further post.