The Internet for Everyone -- Chapter 18, Section 2

SGML and Electronic Publishing

We saw in Chapter 13 that the language of the World-Wide Web, HTML, is "SGML-complaint." In other words, HTML conforms to the older, more general standard known as the Standard Generalized Markup Language. SGML is an outgrowth of a language developed originally on mainframe computers under the name Generalize Markup Language. The purpose of SGML is to provide a mechanism for identifying the "elements" of a document. We saw examples of the sorts of elements one might tag with HTML -- headlines, paragraph boundaries, links to other documents, etc.

But that is only one kind of markup that one might employ. One might also want to tag parts of a document for later textual analysis -- for instance, whenever a proper noun is used, or whenever a piece of slang appears, or when a concept such as revolution or religion is mentioned in a certain way. Thus SGML allows multiple "views" of a document -- one user might read a document sequentially, with the display of semantic tags suppressed; another user might ask the SGML browser to only display passages that relate to a particular concept as identified by the tags.

SGML is general enough to embrace all these different sorts of applications. SGML per se does not define the myriad kinds of element tags one might devise. Instead, SGML provides a framework so that an author, publisher, or scholar can define a set of tags suited to a particular application. For any particular application, the elements that make up a document are defined in a Document Type Definition, or DTD. Like any other SGML application, HTML has a DTD that defines the legal elements.

Besides allowing multiple views of a document, SGML has been defined as an object-oriented language for text. With SGML, text can be marked in such a way that it can be used by many different computer programs and processes. In fact, one of the goals of SGML is to provide a way for documents to be prepared for later re-use. The commercial word processing program of choice is subject to the whims of fashion; SGML has endured for many years and will continue to last. Authors and publishers either use special "authoring tools" that understand the elements of a given DTD, or they may emply translators that allow them to move from a word processing program or desktop publishing package to and from SGML.

Anyone who has ever programmed a computer is aware that a computer language is subject to errors in syntax. This is true for SGML; it is possible to compose an SGML document that contains errors. In such a case the document is said not to conform to the DTD. Commercial SGML products include a "parser" that validates the conformance of a given document. Note that in the particular case of HTML, client programs such as Mosaic, Cello, or Lynx explicitly do not take on the role of validating the language. Instead, they do the best job they can of rendering the document, and they leave it up to the author, and whatever parser tools that might be at his or her disposal, to validate each document.

Following is an example of a page of a book being prepared for publication using an SGML authoring tool from a company called Arbortext. Note the similarity between this example and the tags we saw in HTML. Later in this chapter we will discuss how some scholarly text analysis endeavors benefit from SGML technology.

Go To Section 1 of Chapter 18
Go To Section 3 of Chapter 18