|
September 24, 2003Terabytes and Metadataby Joe Katzman at September 24, 2003 4:57 AM
Barry Briggs has an interesting article about the future of computing, and the likely developments within the field. I think he's on to something here:
Comments
#1 from Iblis at 4:59 am on Sep 24, 2003
If the 200TB hard drive becomes common then Microsoft Windows will take 500GB of space to install (without the extra 'media pack') and each Word document will take half a Gig. Through in some HDTV quality movies from Kazaa and it'll fill up fast enough. Waste not a moment! Go forth and track down Hal Draper's hilarious and prescient 1965 story, "MS FND IN A LBRY," and laugh your head off! Virginia Tech's 4.4 terabyte supercomputer, comprised of 1,100 Apple G5's, is coming online today. More here:
#4 from HiRandy at 6:23 pm on Sep 24, 2003
Barry points to an important challenge and correctly, I think, identifies where there will be a lot of developmental emphasis in the future, but he's way too optimistic about the time frame in which this all will happen. Except for well-structured databases, we haven't made a lot of progress in organizing information over the years. We won't be inverting those tables in any serious way for quite some time. In 1965, I took a course in what was then called "Information Retrieval" given by Gerry Salton at Harvard. He'd done some of the leading work in evaluating strategies to retrieve useful information from a data repository (databases, as in Barry's description, hadn't really been invented yet.) His test bed was a collection of scientific research papers which were known entities (and one could get a handle on what should or should not have been retrieved depending on the search request.) We were sure back then we'd solve all the issues in a few years. Lack of "big" storage devices and significant computing power certainly hampered some of the experimentation, but at this point the discipline of retrieval was still struggling with the basic concepts. Almost 40 years later, we may be smarter with some of the basic concepts, but we're still struggling with most of the rest of information organization and retrieval outside of modern structured databases. Look at where we were then. There are two basic measures for retrieval -- relevance and precision. The first, relevance, is basically how much of the total relevant data is identified and retrieved, and the second, precision, is basically how much of the retrieved data is relevant. (In practice results are ranked, not just retrieved or not but principles the same.) Salton developed these metrics back then and they are still the important ones. Getting all the relevant data is not useful if you have so much irrelevant data mixed in you can't tell the difference, nor is missing a lot of relevant data very helpful, even if your retrieved data has very little noise. We compared strategies like keywords and text index searches, which, as Barry notes, are the foundations of web searching now. Both methods were really reliable only with small or very structured data sets. We experimented with other methods and methods within methods and the challenge was always the same. Strategies that tended to retrieve a higher percentage of the total relevant documents also brought in more irrelevant stuff. Strategies that made sure that most of the retrieved documents were truly relevant to your search criteria did so at a cost of missing a lot of other relevant documents. Barry's search for "data" probably got all the important references to "data" but with so low a precision as to be useless. (Yes, you can ask a sensible question and do better, but it illustrates the problem.) The trick, obviously, is to find methods and data organizations that don't look like a zero-sum game between relevance and precision. One such "exciting" method back then was to use bibliographical (cross-)references as a search and evaluation aid on the theory that relevant documents probably have some connection. Both precision and relevance measures went up. Think about this. In effect, this is all Google is doing by using web page links combined with text and keyword strategies (just think of those future millionaires making it big with a 40 year old idea.) Barry indirectly makes the point that the more relationship structure (meta-data) you build into the data repository, the more one can improve both the relevance and precision metrics in searching. And yes, all the database folks will wince at the idea of even more meta-data, but they sometimes miss how much society has organized itself to fit into relational databases, rather than vice versa. As Berry points out, the new challenge just won't do that. Bibliographic data (or links) are source meta-data, but most meta-data needs to be created -- i.e. adding keywords is creating meta-data. Keywords are cheap since they don't take a lot of effort and they are still the most used form of meta-data outside of databases. Everything else starts to get expensive. Even the database columns and rows highlighted by Barry are hugely expensive meta structures. We simply can't afford to put that kind of structure on most of the information we deal with. Barry's point that meta-data will be computer generated is certainly true at some point, but given the fact that, even now, there is currently no satisfactory way even to generate reliable keywords based on semantic analysis, we've got a long way to go. It's disappointing that we haven't moved farther than we have. Even such a well-documented area like case law searching is only slightly ahead of answers we had years ago. It's still more art than science doing research in case law data repositories. With the development of relational databases, most of the other forms of conceptual meta-data (how you organize the descriptions of your data) got relegated to other places, like Artificial Intelligence and Philosophy (hey, they've worried about that for centuries, right?). There's been some progress over the years, but an awful lot of problems still aren't understood well, much less solved. Bigger storage devices and faster CPUs both give us more brute-force options nowadays especially in areas like pattern recognition, but the hard problems of meta-data organization, structure, relationships, representation, etc. are still conceptual, not computational. And with the exceptions of relational and object databases, we're not that far ahead in implementation from some very old ideas. However,
#5 from Anonymous Coward at 3:12 am on Sep 25, 2003
I have to doubt the 200TB figure given in the article. I personally have over 200GB of used storage in 3 computers (0.1%), and have worked on a supercomputer with 20GB of main memory and over a TB of storage. I almost filled the supercomputer's memory with a large computational fluid dynamics model (don't ask, just a very large data set). So I used 1 part in 10000 of the total knowledge of the human race to solve one tiny ordinary everyday problem? I don't think so. My aerospace department generates terabytes of data on a daily basis. Close to 50GB of my 200GB is taken up with anime, which is very easily organized. There are maybe a total of a couple hundred files, but what matters is that there is a rather small number of series. I guess my point is that it is not the number of TB which counts, but the number of files we put the TB in. As storage grows larger, I believe the average file size grows larger. How will we find anything? The same way we do now. Libraries are particularly effective at organizing knowledge (There are more than 200TB in a large university library, I am sure). I continue to be amazed by the effectiveness of the combination of a good index and a good bibliography in the books. So the question becomes how do we get a good bibliography into files? The World Wide Web is a good start. We should have the ability to attach links, a bibliography of sorts, to an arbitrary file.
#6 from Ed at 1:54 am on Sep 26, 2003
To HiRandy, what about a terraced approach between relevance and precision (width and depth)? Is there a method to alternate between the two strategies to refine the results? Perhaps they could run a few iterations of each search, using the results to further refine queries dynamically... I'm sure this isn't a revolutionary idea, but I am curious how this has been approached (and my own searches have yielded inadequate results :) Barry Brigs' blog doesn't seem to allow comments, so I'll comment here. First, on the growth of hard disks: For many years, hard drives were growing at the same rate as memory, doubling in size every 18 months. About six years ago, significant technological advances allowed this to shift into high gear, with sizes doubling in just 12 months. We now seem to have fallen back to the slower rate, though it's too early yet to tell. Researchers at Seagate think that they have about another factor of 100 to go before they run out of improvements to make. Beyond that, we may have to finally ditch the idea of storing data on spinning rust-coated disks and move to solid-state storage. So we probably won't have 200TB disks in ten years time, but we will have 20 to 50TB disks. Near enough. As for the suggestion that 200TB can store all of human knowledge (with room to spare) ... Well. I have over a terabyte of data here at the office. I have another terabyte at home. I presume Barry isn't counting corporate databases as knowledge. He certainly isn't counting video or music. Or photos. [Either that or I control 2% of all human knowledge. Bwahahaha!] Heck, Usenet alone ships over half a terabyte a day. Metadata is great, but it's not going to cause any huge explosion in storage requirements. Also, it's pretty much useless unless someone indexes it in a database. On the other hand, MS FND IN A LBRY is brilliant and disturbingly prescient. By the way, Virginia Tech's MacCluster has 4.4 TB of RAM. If it's equipped with the 160GB drives that come standard with that model, then it has 176TB of disk.
Post a comment
Here are some quick tips for adding simple Textile formatting to your comments, though you can also use proper HTML tags: |
You're Reading an Individual Post!
If you want to head to the main blog page, just follow the "Main" link in the navigation up top underneath our blog's name. Or click here:
Winds of Change.NET Home
Winds of Change Library
Support VictoryPAC
Recent Entries
· Hero Mouse
· A Few Reasons Why "The Ayers Argument" Isn't An Election-Winner · Speaking of Baked Goods · On Memory, Coincidence, And Missy Cross' D**n Good Banana Bread · In The "Trivial, But Funny" Department · Nostra-Armed Liberal Speaks · Tonight's Debate · Baseball: 9 = 4. · Levy: "Left In Dark Times" · Fun With History · As Long As We're Talking Business - Verizon, Chapter 2 · Shameless Product Plug · The Debate - L'Esprit d'Escalier · So The Debate Is Starting... · Berg v. Obama
Support Winds of Change.NET!
Your support & assistance is greatly appreciated, and makes a difference!
The Winds Crew:
Town Founder: Joe Katzman joe {at} windsofchange. net Joe's Normblog Interview Left-Hand Man: Marc 'Armed Liberal' Danziger armed {at} windsofchange. net A.L.'s Normblog Interview Other Winds Marshals 'AMac', aka. Marshal Festus (AMac@...) Robin "Straight Shooter" Burk 'Cicero', aka. The Quiet Man (cicero@...) David Blue (david.blue@...) 'Lewy14', aka. Marshal Leroy (lewy14@...) 'Nortius Maximus', aka. Big Tuna (nortius.maximus@...) Other Regulars 'Callimachus' (callimachus@...) 'Demosophist' (demosophist@...) Rev./Maj. Donald Sensing 'Molon Labe' (molon.labe@...) 'Neo Neo-Con' Tarek Heggy (tarek@...) Semi-Active: Arthur Chrenkoff 'Gabriel Gonzalez' (in Paris) Tim Oren (tim@...) Trent Telenko (trent@...) Posting Affiliates Athena: Terrorism Unveiled Chester: The Adventures of Chester Dave Schuler: The Glittering Eye Grim: Grim's Lair et. al. Joel Gaines [Russia] Michael Totten MILblogging.com: The MilBlogs directory Murdoc [Military] Situational Awareness team [Military] Nathan Hamm [Central Asia] Randy Paul [Latin America] Robert Koehler [Koreas] Robi Sen [India & S. Asia] Nitin Pai [India & S. Asia] Simon [China & E. Asia] Yehudit: Kesher Talk Emeritus: Adil Farooq (adil@...) Andrew Olmsted [KIA, Iraq] Celeste Bilby (celeste@...) Dan Darling Gary Farber (gary@...) Hossein Derakhshan (hoder@...) T.L. James (tljames@...) Robin Burk (robin@...)
Winds of Change.NET Blogkids & Affiliates
· The Argus: covering Central Asia · Canis Iratus: Glen Wishard · Correct-Amundo: Tech & society · Discarded Lies: Ev & Zorkie · The Flying Kiwi: Donovan Janus · The Glittering Eye: Dave Schuler · Gumptionology: Nortius Maximus · Hot Needle of Inquiry: 'Jinnderella' · Laughing Wolf: C. Blake Powers · Out The Mazoo: 'Mazoo' · Power and Control: M. Simon · Praktike's Place: 'Praktike' · Random Probabilities: Robin Burk · Siberian Light: covering Russia · The Spirit of Man · Good News From the Front · WATCH/: covering the war on terror
Archives By Category
-FEATURES: 48 Ways to Wisdom (24)
-FEATURES: Diaries & Roundups (10) -FEATURES: Military Transformation Uplink (12) -FEATURES: New Energy Currents (20) -FEATURES: Reader Highlights (2) -FEATURES: Regional Briefings (166) -FEATURES: Sufi Wisdom (158) -FEATURES: The Bard's Breath (32) -FEATURES: Winds of Discovery (6) -FEATURES: Winds of War [WoT] (445) 4 HA: 4th-Gen Warfare (103) 4 HA: al-Qaeda (159) 4 HA: Crime, Organized (26) 4 HA: Evil Exists (111) 4 HA: Intelligence/Spycraft (100) 4 HA: Military (530) 4 HA: Nukes, Poisons, Germs (135) 4 HA: Statecraft (29) 4 HA: War on Terror articles (708) Best Of... (180) BIZ: Business & Organizations (135) BIZ: Economics (99) BIZ: Energy (73) CIVIS (233) CIVIS: Copyright Wars (25) CIVIS: Drug Wars (18) CIVIS: Edu-Kooks (76) CIVIS: Free Societies (293) CIVIS: Hall of Shame (163) CIVIS: Hatred Rising (114) CIVIS: Journalism & Media (410) CIVIS: Spirit of America.NET (32) CIVIS: War Within the West (310) COLUMNISTS: M. Simon (13) COLUMNISTS: Tarek Heggy (33) GEO: Afghanistan (79) GEO: Africa (104) GEO: Asia (117) GEO: Aussies & Kiwis (20) GEO: Canada (70) GEO: China (87) GEO: Europe (182) GEO: France (71) GEO: India-Pakistan (113) GEO: Iran (223) GEO: Iraq (966) GEO: Israel (247) GEO: Koreas (64) GEO: Latin America (63) GEO: Middle East (256) GEO: Russia (83) GEO: Saudi Arabia (64) GEO: Sudan (36) GEO: U.K. (70) GEO: U.N. (60) GEO: U.S. of A (506) HUMANITY (88) HUMANITY: Art & Culture (160) HUMANITY: Art - Music (32) HUMANITY: Art - Poetry (6) HUMANITY: Christianity (53) HUMANITY: Heroes & Achievements (231) HUMANITY: History (126) HUMANITY: Islam (183) HUMANITY: Judaism (137) HUMANITY: Love (32) HUMANITY: Philosophy (49) HUMANITY: Spirituality & Religion (74) HUMANITY: Zen & Buddhism (28) Humour (198) Misc. (43) NET: Blogosphere (396) NET: Cyber-Security (16) NET: Grid Computing (3) NET: Spam (24) NET: The Internet (36) NET: The Open Source Meme (18) Personal (196) SCI-TECH: Biotech & Medical (83) SCI-TECH: Eco-tech (82) SCI-TECH: Nanotech (27) SCI-TECH: Science (112) SCI-TECH: Space (75) SCI-TECH: Technology (145) SPORTS (45) SPORTS: Baseball (76) Trends (65) USA: America Catch-all (19) USA: Anti-Americanism (6) USA: California Politics (8) USA: Conservatives & GOP (40) USA: Dem Party Renewal (76) USA: Domestic Issues (54) USA: Elections (111) USA: Grand Strategy (15) USA: Homeland Security (106) VictoryPAC (3) Winds of Change.NET (53)
Archives by Date
October 2008
September 2008 August 2008 July 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006 May 2006 April 2006 March 2006 February 2006 January 2006 December 2005 November 2005 October 2005 September 2005 August 2005 July 2005 June 2005 May 2005 April 2005 March 2005 February 2005 January 2005 December 2004 November 2004 October 2004 September 2004 August 2004 July 2004 June 2004 May 2004 April 2004 March 2004 February 2004 January 2004 December 2003 November 2003 October 2003 September 2003 August 2003 July 2003 June 2003 May 2003 April 2003 March 2003 February 2003 January 2003 November 2002 October 2002 September 2002 August 2002 July 2002 June 2002 May 2002 April 2002 Joe's Old Archives, By Title: April - June 2002 July - December 2002
Winds Blogroll
Top Prospects
SP Normblog (LHP) SP Solomonia (RHP) RF Mader Blog CF Donklephant LF Harry's Place C Critical Mass 1B Tigerhawk 2B Gideon's Blog SS Alexander the Average 3B Democracy Arsenal UT INF Pundita DH Counterterrorism Blog PEN Liberals Against Terrorism CL Gates of Vienna MASCOT Huffington's Toast MGR Robert Tagorda GM Conservative Grapevine Humour Blogs
Support VictoryPAC· Cox & Forkum (cartoons) · Day By Day (cartoons) · User Friendly (cartoons) · Iowahawk (satire) · Scrappleface (satire) Religious Blogs · Conscientia (baha'i) · Unlearned Hand (bud) · Eve Tushnet (cath) · Muslim Under Progress (isl) · Ideofact (isl) · Kesher Talk (jew) · Rabbi Lazer Brody (jew) · Rishon Rishon (jew) · Rev. Donald Sensing (prot) Other Team Memberships · AlwaysOn [JK] · Blogcritics.org [JK] · Tech Central Station [JK] Blog Services< · NZ Bear's Ecosystem · Blogstreet · Daypop Top 40 · Technorati · Movable Type.org · New York Times Permalinks · Write A Better Blog |
http://www.windsofchange.net/windsopcentre-cms/trackback.cgi/1812
Listed below are links to weblogs that reference
"Terabytes and Metadata"