Barry Briggs has an interesting article about the future of computing, and the likely developments within the field. I think he's on to something here:
bq. "Put a different way, in 200 terabytes one can store the entire accumulated knowledge of the human race with much more than half the disk to spare. All this capacity raises two fundamental questions: if we can put all this knowledge on a single commodity disk, how will we ever find anything? And if all that only requires a fraction of the available space, what will we use the rest for? The answers, it turns out, are related; and, I think, as we outline them we will also begin to see the shape of the next great revolution in computing."








If the 200TB hard drive becomes common then Microsoft Windows will take 500GB of space to install (without the extra 'media pack') and each Word document will take half a Gig.
Through in some HDTV quality movies from Kazaa and it'll fill up fast enough.
Waste not a moment! Go forth and track down Hal Draper's hilarious and prescient 1965 story, "MS FND IN A LBRY," and laugh your head off!
Virginia Tech's 4.4 terabyte supercomputer, comprised of 1,100 Apple G5's, is coming online today. More here:
*VT Supercomputer*
Barry points to an important challenge and correctly, I think, identifies where there will be a lot of developmental emphasis in the future, but he's way too optimistic about the time frame in which this all will happen. Except for well-structured databases, we haven't made a lot of progress in organizing information over the years. We won't be inverting those tables in any serious way for quite some time.
In 1965, I took a course in what was then called "Information Retrieval" given by Gerry Salton at Harvard. He'd done some of the leading work in evaluating strategies to retrieve useful information from a data repository (databases, as in Barry's description, hadn't really been invented yet.) His test bed was a collection of scientific research papers which were known entities (and one could get a handle on what should or should not have been retrieved depending on the search request.) We were sure back then we'd solve all the issues in a few years. Lack of "big" storage devices and significant computing power certainly hampered some of the experimentation, but at this point the discipline of retrieval was still struggling with the basic concepts. Almost 40 years later, we may be smarter with some of the basic concepts, but we're still struggling with most of the rest of information organization and retrieval outside of modern structured databases.
Look at where we were then. There are two basic measures for retrieval -- relevance and precision. The first, relevance, is basically how much of the total relevant data is identified and retrieved, and the second, precision, is basically how much of the retrieved data is relevant. (In practice results are ranked, not just retrieved or not but principles the same.) Salton developed these metrics back then and they are still the important ones. Getting all the relevant data is not useful if you have so much irrelevant data mixed in you can't tell the difference, nor is missing a lot of relevant data very helpful, even if your retrieved data has very little noise.
We compared strategies like keywords and text index searches, which, as Barry notes, are the foundations of web searching now. Both methods were really reliable only with small or very structured data sets. We experimented with other methods and methods within methods and the challenge was always the same. Strategies that tended to retrieve a higher percentage of the total relevant documents also brought in more irrelevant stuff. Strategies that made sure that most of the retrieved documents were truly relevant to your search criteria did so at a cost of missing a lot of other relevant documents. Barry's search for "data" probably got all the important references to "data" but with so low a precision as to be useless. (Yes, you can ask a sensible question and do better, but it illustrates the problem.)
The trick, obviously, is to find methods and data organizations that don't look like a zero-sum game between relevance and precision. One such "exciting" method back then was to use bibliographical (cross-)references as a search and evaluation aid on the theory that relevant documents probably have some connection. Both precision and relevance measures went up. Think about this. In effect, this is all Google is doing by using web page links combined with text and keyword strategies (just think of those future millionaires making it big with a 40 year old idea.)
Barry indirectly makes the point that the more relationship structure (meta-data) you build into the data repository, the more one can improve both the relevance and precision metrics in searching. And yes, all the database folks will wince at the idea of even more meta-data, but they sometimes miss how much society has organized itself to fit into relational databases, rather than vice versa. As Berry points out, the new challenge just won't do that.
Bibliographic data (or links) are source meta-data, but most meta-data needs to be created -- i.e. adding keywords is creating meta-data. Keywords are cheap since they don't take a lot of effort and they are still the most used form of meta-data outside of databases. Everything else starts to get expensive. Even the database columns and rows highlighted by Barry are hugely expensive meta structures. We simply can't afford to put that kind of structure on most of the information we deal with. Barry's point that meta-data will be computer generated is certainly true at some point, but given the fact that, even now, there is currently no satisfactory way even to generate reliable keywords based on semantic analysis, we've got a long way to go. It's disappointing that we haven't moved farther than we have. Even such a well-documented area like case law searching is only slightly ahead of answers we had years ago. It's still more art than science doing research in case law data repositories.
With the development of relational databases, most of the other forms of conceptual meta-data (how you organize the descriptions of your data) got relegated to other places, like Artificial Intelligence and Philosophy (hey, they've worried about that for centuries, right?). There's been some progress over the years, but an awful lot of problems still aren't understood well, much less solved. Bigger storage devices and faster CPUs both give us more brute-force options nowadays especially in areas like pattern recognition, but the hard problems of meta-data organization, structure, relationships, representation, etc. are still conceptual, not computational. And with the exceptions of relational and object databases, we're not that far ahead in implementation from some very old ideas.
However,
I have to doubt the 200TB figure given in the article. I personally have over 200GB of used storage in 3 computers (0.1%), and have worked on a supercomputer with 20GB of main memory and over a TB of storage. I almost filled the supercomputer's memory with a large computational fluid dynamics model (don't ask, just a very large data set). So I used 1 part in 10000 of the total knowledge of the human race to solve one tiny ordinary everyday problem? I don't think so. My aerospace department generates terabytes of data on a daily basis.
Close to 50GB of my 200GB is taken up with anime, which is very easily organized. There are maybe a total of a couple hundred files, but what matters is that there is a rather small number of series.
I guess my point is that it is not the number of TB which counts, but the number of files we put the TB in. As storage grows larger, I believe the average file size grows larger.
How will we find anything? The same way we do now. Libraries are particularly effective at organizing knowledge (There are more than 200TB in a large university library, I am sure). I continue to be amazed by the effectiveness of the combination of a good index and a good bibliography in the books. So the question becomes how do we get a good bibliography into files? The World Wide Web is a good start. We should have the ability to attach links, a bibliography of sorts, to an arbitrary file.
To HiRandy, what about a terraced approach between relevance and precision (width and depth)? Is there a method to alternate between the two strategies to refine the results? Perhaps they could run a few iterations of each search, using the results to further refine queries dynamically...
I'm sure this isn't a revolutionary idea, but I am curious how this has been approached (and my own searches have yielded inadequate results :)
Barry Brigs' blog doesn't seem to allow comments, so I'll comment here.
First, on the growth of hard disks: For many years, hard drives were growing at the same rate as memory, doubling in size every 18 months. About six years ago, significant technological advances allowed this to shift into high gear, with sizes doubling in just 12 months. We now seem to have fallen back to the slower rate, though it's too early yet to tell.
Researchers at Seagate think that they have about another factor of 100 to go before they run out of improvements to make. Beyond that, we may have to finally ditch the idea of storing data on spinning rust-coated disks and move to solid-state storage.
So we probably won't have 200TB disks in ten years time, but we will have 20 to 50TB disks. Near enough.
As for the suggestion that 200TB can store all of human knowledge (with room to spare) ... Well.
I have over a terabyte of data here at the office. I have another terabyte at home. I presume Barry isn't counting corporate databases as knowledge. He certainly isn't counting video or music. Or photos.
[Either that or I control 2% of all human knowledge. Bwahahaha!]
Heck, Usenet alone ships over half a terabyte a day.
Metadata is great, but it's not going to cause any huge explosion in storage requirements. Also, it's pretty much useless unless someone indexes it in a database.
On the other hand, MS FND IN A LBRY is brilliant and disturbingly prescient.
By the way, Virginia Tech's MacCluster has 4.4 TB of RAM. If it's equipped with the 160GB drives that come standard with that model, then it has 176TB of disk.