The Text Encoding Initiative

Elizabeth A. R. Brown, March 1990

Editor's Note: The following report was submitted by Elizabeth A. R. Brown in her capacity as the AHA's appointee on the Advisory Board of the Text Encoding Initiative.

Hosted by the University of Illinois, Chicago, the first meeting of the Text Encoding Initiative's Advisory Board brought together seventeen representatives from key professional and learned societies. These groups represented the spectrum of academic disciplines from computer science to lexicography to literary studies as well as professional librarians and publishers. The purpose of the meeting was to seek the views of the newly constituted Advisory Board concerning the structure and proposed strategy of the Text Encoding Initiative (TEI), to explain its relevance to the interests of these groups, and to encourage active participation in the work of the Initiative by the groups' members.

The Initiative began in the fall of 1987 at the instigation of the Association for Computers and the Humanities (ACH), which was concerned about the variety of formats and the deficiencies of encoding schemes used in the preparation of computer readable texts for scholarly analysis. Participants at an ACH planning conference in November 1987 at Vassar College agreed that it was necessary and feasible to define guidelines for both the interchange of existing encoded texts and the creation of newly encoded texts. The guidelines would specify both what features should be encoded (at a minimum) and how they should be encoded, as well as suggest ways to describe the resulting encoding scheme and its relationship with pre-existing schemes. Compatibility with existing schemes would be sought where possible, and in particular, ISO standard 8879, Standard Generalized Markup Language (SGML), would provide the basic syntax for the guidelines if feasible.

After the Vassar meeting, ACH joined with the Association for Literary and Linguistic Computing (ALLC) and the Association for Computational Linguistics (ACL) as co-sponsors of the project and defined a four-year work plan to achieve the project's goals. Funding for the work plan has since been provided by substantial grants from the National Endowment for the Humanities and the European Economic Community. Additional funding is being sought from industry and private foundations.

The work plan is coordinated by a six-member steering committee, comprising representatives from the sponsoring organizations. An Advisory Board of representatives from almost twenty participating scholarly organizations ensures that a broad range of interested researchers are able to participate in the development of the guidelines. Two editors coordinate the work of the project's four working committees, each of which is responsible for a distinct part of the work plan.

Committee 1, the Committee for Text Documentation, with a membership drawn largely from the library and archival management communities, is dealing with issues concerning the cataloguing and description of key features of encoded texts. It is drawing on work already done in this field for bibliographic and social science data, for example in the Anglo-American Cataloguing Rules, the American National Standard for Bibliographic Reference, and the Standard Study Description used by a number of social science data archives. All the committees are expected to work within established frameworks where these are available, as they are here.

Committee 2, The Committee for Text Representation, is concerned with the encoding of such features as layout and character sets. It will provide precise recommendations covering those features of continuous discourse for which a convention already exists in printed or written sources. This will involve a consideration of the character sets of all alphabetic scripts currently used in computer-based research. Explicit consideration of non-alphabetic scripts, though not excluded, has been deferred; transcriptions of spoken language will, however, be included. The committee will also recommend ways of representing the structural divisions of a text (book, chapter, paragraph, etc.) and all other features conventionally signalled in printed or written texts, such as emphasis, quotation, critical apparatus, etc.

Committee 3, the Committee for Text Analysis and Interpretation, has the largest and most open-ended set of responsibilities of the four. It will provide discipline-specific sets of tags appropriate to the analytic procedures favored by that discipline, but in such a way as to permit their extension and generalization to other disciplines using analogous procedures. Because this is a very large task, Committee 3 is focussing initially on a single discipline (linguistics), chosen primarily because of it clear relevance to all other text-based types of analysis. As work proceeds, the focus of this committee will shift toward literary analysis and other humanistic disciplines.

Committees 1, 2, and 3, with an average membership of ten, will set up sub-committees to do the preliminary design work for tag sets within specialized areas. Committee 3 already has one subcommittee, concerned with tag sets for dictionary markup, which has already produced a set of preliminary guidelines for monolingual dictionaries. A subcommittee of Committee 2 is also being formed, concerned with the tagging of historical sources, to take advantage of the substantial progress already made in this area by a network of European scholars collaborating on the Kleio project.

Committee 4, the Syntax and Metalanguage Committee, has determine that the syntactic framework of SGML is adequate for all foreseeable applications within the TEI's scope, and thus will provide the basic syntax. The guidelines will depart from SGML only if it proves inadequate to the needs of research. The committee is currently attempting to determine the extent to which all features of SGML can be recommended. This committee is also surveying major existing schemes and developing a formal metalanguage with which to describe these schemes and the scheme developed for the guidelines, and to provide a formally specifiable mapping between them. Among the committee's other tasks are validation and testing of the guidelines as they emerge and arbitration on matters of SGML-conformance.

In addition to the three sponsoring organizations and the AHA, the following associations are currently represented on the Advisory Board: American Anthropological Association; American Philological Association; American Society for Information Science; Association for Computing Machinery; Association for Documentary Editing; Association for History and Computing; Association Internationale Bible et Informatique; Canadian Linguistic Association; Dictionary Society of North America; Electronic Publishing SIG; International Federation of Library Associations and Institutions; Linguistic Society of America; Modern Language Association.

After an initial presentation about the history, background, objectives, and structure of the TEI, delegates were invited to comment on their own interests and the constituencies they served. A series of presentations concerning the implications of the TEI for humanities research, for computational linguistics, and for the language and information industries followed. The goals and responsibilities of each of the working committees were then described, as outlined above. The second full day of the meeting began with a brief tutorial on SGML and a longer description of the design principles, scope, and end products of the guidelines. After a wide ranging and useful discussion, in which some constructively critical reactions were expressed, members of the Advisory Board expressed approval of the objectives, organizational structure, and design goals of the Initiative, as they had been presented at the meeting.

If you would like more information about the TEI, or if you would like to volunteer your services for one of the committees, please contact C.M. Sperberg-McQueen, Computer Center MC 135, University of Illinois at Chicago, Box 6998, Chicago, IL 60680 (U35395 a UICVM bitnet).

Elizabeth A.R. Brown
AHA Delegate to TEI
Brooklyn College and The Graduate School
The City University of New York