The Internet for Everyone -- Chapter 18, Section 6

Organizing and Cataloging Internet Resources

If you walk into any modern research library, you would not even consider the prospect of browsing the shelves to find something. A building with two million books would take a lot of browsing before you found what you were looking for. Instead, you would make use of the online catalog. Modern online catalogs allow searching in the traditional ways -- author's name, title, and subject -- as well as keyword searching of the online abstracts. You narrow your search down to a short list of titles, you jot down the call numbers, and you head for the stacks.

When the quantity of resources on the Internet grows to a scale reaching or exceeding the sum total of the paper titles in all research libraries on Earth, it will become at least as unthinkable to browse to find the document you are looking for. We will have to rely on a similar strategy for Internet searches as for bricks-and-mortar library searches: First you will perform a catalog search, then you will select among the titles that seem promising.

From the perspective of users, Internet navigation already is a serious concern. Organizing tools like Archive and Veronica are a beginning, but they are likely to prove inadequate over time. One of the reasons why library catalogs work well is that they are created with a great deal of human effort. At first glance it may seem appealing to try to shoehorn Internet cataloging into an existing scheme, such as the Dewey Decimal System or the Library of Congress system. But these systems really have a different goal than one wants in an Internet catalog: the call number system used conventional catalog really is a way to associate a specific shelf location with an entry in an index. Neither Dewey nor LC numbering really attempts to create a rational ordering for all knowledge, in which all logically related items are placed in adjacent spots in the catalog. Knowledge is a multidimensional web that cannot be represented in a simple hierarchical cataloging scheme.

But this is not to say that the Internet could not benefit from the work of professional catalogers. Librarians who specialize in cataloging explain that their use of "controlled vocabularies" in the creation of catalog is what helps build successful catalogs. An Internet catalog might have multiple keywords associated with a given title; users will hone their searches until they obtain a manageable list of candidate documents. Because professional catalogers assign descriptive keywords from generally-accepted lists of descriptive terms, there is a measure of consistency in the cataloging of similar items. When a user does a subject search, related items show up in the online list, even if the call numbers are not adjacent.

Libraries already face the interesting question of deciding whether to list networked information resources in their online catalogs. Already some libraries have done so for selected e-journals. With the proliferation of networked resources, each library undertaking to catalog the Internet faces the same issues of what ought to be included in the virtual collection that confront Gopher administrators. Moreover, the effort involved in preparing a catalog record exceeds by a considerable margin the cost of adding a Gopher link. In fact, Martin Dillon, Director of Research at OCLC, has proposed a four-tier approach to cataloging Internet resources, with the effort expended proportional to the scholarly value of the item being cataloged. In brief, these levels are:

Cataloging is equivalent to that used for scholarly materials today.
"Brief record" cataloging as used by libraries for materials not of the highest rank.
Cataloging by automated means, supplemented by some human editing.
Catalog record is supplied by item's creator and collected automatically.

Should libraries undertake to catalog networked resources, they may wish to adopt a scheme such as Dillon proposes. Once a collection of documents reaches a certain size, readers must have a viable catalog in order to navigate the collection.

Libraries collectively will probably find it essential to form consortia and employ specialization to provide a much-needed division of labor in cataloging. Rather than each library undertaking to catalog the entire Internet, network-wide cooperative cataloging efforts may evolve. Professional societies such as IEEE and the American Mathematical Society could serve as organizers of such efforts.

One challenge confronting anyone who tries to catalog Internet resources is the question of where one resource ends, and another begins. A given server on the Internet could be considered a single resource, or it could be a respository of thousands of documents, each of which deserves cataloging. For instance, consider an electronic book archive on a Gopher server. For each book, sub-folders are used to break the book into usable chunks. One book might be broken down into chapters; another into major sections, chapters, and parts of chapters. In neither case is there any agreed-upon pointer that defines that one entry is the actual start of the book. Gopher simply does not have the built-in structure necessary for an automated indexer, or for a human, to reliably define the beginning and end of an online document.

Members of the Internet Engineering Task Force have devised a scheme called IAFA (for "Internet Assigned Fields Authority") which may provide an answer for how resources are identified by Internet information providers, for the sake of automated catalogers like Archie. An Internet resource provider who wants to have his or her resource cataloged would fill out an IAFA template and place the information online. This fits exactly with items 3 and 4 of Dillon's model. In cases were a title merits the labor-intensive effort of human cataloging, the IAFA records could serve as a starting point. Here is the beginning of the proposed IAFA template for documents:

Template-Type:          (any one of DOCUMENT, IMAGE or SOUND)
Category:  
Title:  
Author-Name:  
Author-Organization-Name:
Author-Organization-Type:  
Author-Work-Phone:  
Author-Work-Fax:
Author-Work-Postal:  
Author-Job-Title:  
Author-Department:
Author-Email:  
Author-Handle:  
Author-Home-Phone:  
Author-Home-Postal:
Author-Home-Fax:  
Record-Last-Modified-Date:
Record-Last-Modified-Department:  
Record-Last-Modified-Email:

Another IETF effort, the work to define standard Uniform Resource Identifiers, may yield both a scheme that fosters cataloging of Internet resources, as well as a standard mechanism to allow access to those resources. One goal of the URI effort is to define a Uniform Resource Name -- roughly analogous to an International Standard Book Number -- that could be "resolved" into a particular Uniform Resource Locator. As an analogy, consider how customer might walk into a bookstore armed with the ISBN for a book on theoretical physics. A clerk looks up the book on the store's inventory computer, determines that the book is in stock, and helps the customer fetch the desired title from the shelf. Similarly, Internet users may someday be able to submit Uniform Resource Names to an automated service that locates copies of the work in question, and returns a list of Uniform Resource Locators -- i.e., specific pointers to actual copies of the work online. A client program like Mosaic could automatically fetch a copy of the work from a nearby site.

Such mechanisms are not trivial to define. We have a pretty good idea of what we mean when we refer to titles of books. (Of course, even that can be murky; ask your book clerk for The Bible, and you will certainly be asked to clarify as to which edition.) With online resources, a standard resource identifier has to be able to point to ephemeral documents -- "The current weather forecast for Austin Texas" or "today's New York Times." Although the answers are not yet in sight, it is clear we need something like the URI mechanism to make online publishing -- and cataloging -- workable.

As online library catalogs point to more and more Internet resources, the question arises: When does the catalog end, and the delivery of documents begin? Why not have "hot links" to online resources in the catalog itself? As early as November 1992, one library automation vendor, VTLS, was demonstrating a system capable of doing exactly that. Someday soon the catalog in your public library may be equally adept at showing you a path to a shelf location as it is at offering a Mosaic-like view of a document fetched across the Internet.

What Constitutes an Electronic Journal?

Just about anyone can declare an electronic serial to be an e-journal. By what standard do we decide whether to point to it in a library's catalog? Include all journals? Those that manage to produce two issues? Those that have been assigned International Standard Serial Numbers? Those that appear to be "scholarly"? Those that are peer reviewed? Over time, librarians will probably have to apply the same sort of collection development procedures to online information as they do for their print holdings. Although most e-journals are currently delivered without a fee, there are costs to including substandard material--the human and machine costs of preparing and cataloging the material as well as the time the patron spends avoiding the chaff.

Some may believe that Internet-based publishing removes the need for discrete documents and journals. After all, why not just identify a resource as "the Gopher at Carnegie Mellon" or "the Web server at the Sorbonne" with the associated host names and ports? The answer lies in the need to be able to refer to particular issues of documents in order to continue discussion of the ideas contained therein. If documents on the Internet have no definite editions or realizations, but rather exist as part of an undifferentiated mass that can change at the whim of the author or editor, readers may find themselves trying to discuss moving targets.

So far, serious attempts at mounting e-journals have continued the practice of delivering regular issues, with some sort of editorial review process prior to the event of "publishing" online. Once publication occurs, the articles in a given issue are frozen; the author does not have the luxury of correcting errors in the online edition. These aspects of the traditional publishing process are essential for an Internet-based literature to develop.

Although most e-journals do follow the practice of having definite, numbered editions, the model of online publishing does offer certain liberating aspects. For instance, an e-journal offered at no charge is under no obligation to meet a particular printing schedule. Subscribers cannot complain that the periodical is late when they have no money at stake. If there are no advertisers, there is no pressure to publish. Of course, readers will expect that something called a "periodical" will arrive on a somewhat regular schedule, the the journal will cease to be worth looking for.

Another liberating aspect is that there need not be any pressure to "pad" an issue with a certain number of articles. A given issue of an e-journal could have only one article, or it could have many. Editors of print journals must live with a "feast or famine" cycle in which the number of good articles in the hopper seldom corresponds to the number of ad pages that have been sold.

Certain problems arise in the handling of online e-journals. For instance, with many sites capturing e-journals as published, and with all such local archives equally visible on the Interenet, the question arises as to which archives are definitive. Already one prominent publisher of an e-journal archive has complained bitterly that out-of-date copies of experimental samples of the journal appeared in an archive. Those with a historical perspective might argue that old issues, no matter how experimental, are valid holdings for an archive; no print magazine has the luxury of demanding that a library toss out old issues that for whatever reason fail to measure up to current standards.

A related concern for the e-journal publisher is assuring that archives are complete. If a site offers to the Net an archive service purporting to include a particular journal, without qualifying what is in the collection, the authors and editors associated with that journal will expect online compilation to be complete. Over time we can expect some online archives to develop a reputation as being more reliable than others; readers, authors, and publishers will flock to these superior archives.

Finally, at some point it will become necessary for e-text archives to adopt policies of collection development. As self-declared e-journals offered freely on the Internet proliferate, serious collections will have to exercise some judgment as to which items merit the human and computer resources of cataloging, and which items are no longer useful for retention in a collection.

Internet Literature

One sign that the Internet has come into its own as a medium will be when we begin to see Internet works cited in general literature. [13] Brewster Kahle, the inventor of WAIS, argues that the ability for a reader to fetch a document cited in another document will be critical to the creation of literature on the Internet:

"B[ullletin] Board systems have not produced any astounding works of literature, I suggest, because it is difficult to reference older works. If older works were easy to find and reference, then people would be more inclined to make better entries. Better entries would get more references and be used more. No BBoard systems, that I know of, make this easy. Since editors, content searching, and archiving are all fundamental parts of the WAIS architecture, we stand a better chance of high quality works being produced." [14]

Scholarly Text Analysis Projects

Computers are widely used in the creation of print materials and in electronic journals. They are also increasingly being used to assist in the scholarly analysis of literature. Many of the efforts involved in such research find the Internet to be a natural medium to support research and dissemination of results. Organizations involved in such efforts include:

The Text Encoding Initiative: This is a multinational cooperative effort to encode online texts such as classical writings; active organizations include Oxford University, the University of Chicago, the University of Virginia, and others. They have produced a set of TEI Guidelines, specifying rules for SGML markup for online e-texts.
Center for Electronic Texts in the Humanities (CETH): Based at Princeton, this organization seeks to advance scholarship in the humanities through the use of high-quality electronic texts. One CETH project calls for placing a comprehensive corpus of early works by women online.
The Electronic Text Center at the University of Virginia: This organization makes texts available for online research as part of their effort to extend Thomas Jefferson's vision of an "Academical Village" to the electronic realm. They are encoding a large body of writings in SGML, and have available on their campus search software that allows sophisticated searches, for instance following the flow of an idea across time, both in online fulltext versions of classical works, and in online reference documents such as the Oxford English Dictionary. [15]
The ARTFL project: The American and French Research on the Treasury of the French Language (ARTFL) is a cooperative project established in 1981 by the Centre National de la Recherche Scientifique and the University of Chicago. Its objectives over the last eight years have been to restructure this database in such a way as to make it accessible to the research community, and to develop tools for its analysis. ARTFL began with a French project to use computers to assist in the creation of a dictionary of the French language; a large collection of French language texts was transcribed over a 20 year period.

Electronic text analysis projects are using the Internet to support collaboration and to publish information about their research.

Where to Find Electronic Texts and Journals

You can find more information about various collections of electronic journals and online texts at these locations:

Indiana University Library: Point your Gopher client to: gopher.indiana.edu (port 1067, selector 1/letrs/gopher).

CICNet Gopher e-journals archive: Point your Gopher client to: gopher.cic.net (port 70, selector 1/e-serials).

Electronic Text Center at the University of Virginia: http://www.lib.virginia.edu/etext/ETC.html

Go To Section 5 of Chapter 18