DATA STRUCTURES FOR BIOLOGICAL RECORDING

Specifications for databases to record occurrences of organisms

CABI Bioscience, Bakeham Lane, Egham, Surrey, TW20 9TY, UK

Over the last century, the BMS has held over 1000 residential forays and one-day field meetings during which many hundred thousand individual observations have been made of fungi, of when, where and on what they occur. No-one knows the full extent of those observations: humans have a wonderful capacity to take in information, and a single walk in the countryside may provide vast amounts of data of which the proportion transmitted to storage outside of the human mind, in earlier days to paper, and now to computer, is inevitably only a small part of the whole: the collective consciousness of our Society's membership forms one of its great assets. Nevertheless, even this small proportion stored outside of human memory forms an important contribution to the international effort to describe fungi and note their occurrence and distribution. The Society's centenary provides an opportunity to reflect on that contribution, to assess it within the context of the present interest in biodiversity, to evaluate its importance, and to make plans to ensure that it not only continues through the Society's next century, but is also refined and kept up-to-date.

Up to about 1970, virtually all recording of living organisms was done on paper. Since then, and particularly since the early 1980s, there has been a huge trend towards using computers. At the same time, the amount of recording being carried out has soared: mycology has been no exception, and there the BMS has led. By the end of the century, it seems likely that virtually all biological recording will use computer technology. During the last fifteen years there has been a great debate on how to cope with all this machine-readable information: how to generate, receive, store, output and otherwise share it.

This website looks at some of these questions from the point of view of mycologists and their fungi. In doing so, it is important to recognize that data handling technologies are currently evolving at a phenomenal rate (it has been said that you should throw out everything you've learned about databases every two years and start again), so questions of what computer hardware and commercial software should be used, being ephemeral, will rarely be addressed in this chapter, nor will back-up regimes or computer virus protection systems, nor the wonderful but rapidly changing possibilities being opened up by information superhighways.

The great benefits of having computerized information are rapid searching, mechanical manipulation, and easy duplication of data. These benefits, however, are by no means guaranteed (nor are they always necessarily benefits). The adage 'garbage in, garbage out' will always be true: if the information on disk is incorrect, or has been stored in a poor structure, it may be even less accessible for searching than its paper equivalent, mechanical manipulation may be impossible, and the ease with which it is duplicated may turn out to be a two-edged and very sharp sword! Until artificial intelligence truly arrives and computers can make allowances for such defects, we have to rely on man's limited albeit natural abilities to avoid them.

It is thus very important to recognize the constraints under which late twentieth century computers work and make sure our recording systems function correctly within those constraints. The chapter will accordingly concentrate on the rather less rapidly changing practicalities of gathering mycological information, editing it and using it for output. Practical problems in setting up feeder databases for gathering data will be briefly examined, and when output is considered, itshould be obvious that creation of distribution maps, while fascinating and useful, constitutes only a small facet of what it is possible to produce from a properly-structured system.

In particular, though, the first, and largest part of the chapter will be devoted to the question of what information we are collecting, and how it should be structured. There has been a lot of work published by various groups and individuals, including TADWG (the Taxonomic Databases Working Group), IOPI (the International Organization for Plant Information), ERIN (the Environmental Resources Information Network), MINE (the Microbial Information Network for Europe), but much of this work is in a form not readily accessible to, or easily assimilated by the field biologist, and almost none of it is presented from the point of view of mycology.

The meat of this first part comes from hitherto unpublished work accumulated over the last ten years at the IMI, designing, constructing and using databases for recording fungi and data associated with them. The need for scores of tables and hundreds of fields has been recognized at IMI and elsewhere, and the task of gathering data and allocating its elements into such a structure is daunting. At IMI, this work is in progress, but by no means complete: many fields, and sometimes even whole tables of data are not yet in use, or are only in use by one or two individuals within the Institute, some still experimentally. Many others, though, are in daily and successful operation.

In this review, only tables and fields of interest in biological recording (in a broad sense) will be considered in detail. For each a comparison will be made with the fields used by the BMS, and possible improvements suggested by recent experience will be discussed. In this way, some assessment will be made of the BMS's contribution to recording fungi. Lest this assessment should seem at times critical, I would like to point out that almost all the BMS database fields are virtually unchanged in design since I set them up as Foray Secretary in the mid-1980s. Any criticisms therefore reflect purely on myself and on the rate of change of my own perception of problems in recording by computer. A fundamental familiarity with relational database concepts of tables, fields and indexes, and a certain basic knowledge of computers is assumed. General considerations Major data groups

When an analysis is made of information used in recording fungi, seven major groups of data stand out. In this chapter, each of these groups is treated as a separate database. These databases are not exhaustive, and each one is actually a suite of different but inter-connected tables. The seven databases, all identified in the following paragraph, are each of roughly similar size and complexity. In the present chapter there is not room to review them all. Only the first database, being the most important from the point of view of 'field' recording, will be reviewed, and even then the review will concentrate only on those parts which are most relevant to foraying. The ways in which this database links with and relates to the other databases will, however, be indicated wherever appropriate.

The first database comprises information about observations on the occurrence of fungi and of other organisms associated with them in time and space, and about living or dried reference collections of these organisms. Although this database deals with not only fungi, but also plants and animals, and not just with collections, but with field observations not backed by voucher material, it will for convenience be referred to as the Collections Database. The second database comprises information about names of organisms and their taxonomic position. This will be referred to as the Nomenclature & Taxonomy Database. The third, the Bibliography Database, deals with relevant publications: books, journals, pamphlets and other printed material. The fourth, the People Database, contains information about relevant individuals. The fifth, the Descriptions Database, stores descriptions of fungi, while the sixth, the Illustrations Database, stores information about illustrations of fungi, and links to digitized illustrations. Lastly, there is the Geography Database, which stores data relating to different locations.

Of those seven databases, the BMS Foray Records Database and the BMS/JNCC Database (the paper-based foray lists found in back-numbers of its publications and now computerized, and all the records of British fungi referred to in BMS publications, now being computerized) correspond roughly to the Collections Database. Through publication of various checklists over the years, the BMS has built up a significant but often ageing, and mainly paper-based collection of records on the nomenclature and taxonomy of British fungi. It also holds considerable bibliographic information, again mostly on paper. Its computerized membership records form a potentially valuable start (but only a start) to a database of its human resources for field mycology. Through the present joint project on the ascomycete flora of Britain, it is building up a computerized database of descriptions, to which must be added a vast but disorganized paper database of descriptions and illustrations within the back-numbers of its publications. Specialist geographical information is not held by the BMS, but can be imported. Field names

All field names used in this chapter are identified by being placed within square brackets (e.g. [CloxAccouA]). To assist in keeping track of all of the fields of these tables, a convention used in IMI to name fields will be adopted: each field has a name which is unique ten characters long, and a standard unique short description less than one hundred characters long. In the present chapter, each field will be identified by its name in the heading preceding discussion of that field. The short description which accompanies that name in the heading is not the standard unique short description used at IMI, but is a short description tailored for the needs of the current chapter. The BMS employs no conventions for naming fields. Field names in BMS databases vary in length, and tend to be a short description of the main contents.

The IMI convention for field names is structured: the first character identifies the database to which the field belongs ('C', collections; 'N', nomenclature and taxonomy; 'B', bibliography; 'P', people; 'D', descriptions; 'I', illustrations; 'G', geography), and is always upper case; the next three characters, all lower case letters or digits, identify the individual table containing that field within the database (e.g. 'Clox' identifies the Locality Cross-reference Table within the Collections Database); the next five characters (the first an upper case letter, the remaining four lower case letters, digits or the underline character '_') constitute a mnemonic of the function of the field; the last character, which is also upper case, indicates the data type to which the field belongs ('N', numeric; 'A', alphanumeric; 'D', date). General data conventions

A note on some general conventions used for storage of data in the text fields of these databases may be useful. The first character of a text field is never upper case, unless it relates to a proper noun or would for some other reason be upper case if it were to appear in the middle of a normal text sentence. The last character of a text field is never a full stop, unless it relates to an abbreviation. In alphanumeric some fields which contain a text version of data which could alternatively have been expressed in numeric form, inadequate information is expressed using the character 'x' (thus, for example, the field [CpexSyearA] (year in which action began) should contain '18xx' for a collection known to have been made in the 19th century): for such fields, an alphanumeric format has the advantage that uncertainties and imprecisions of this type can be expressed. The standard shorter ASCII code values 32, 33 and 35-127 are used for text entry. ASCII value 34 (double inverted commas), not permitted because its presence in data disrupts output of comma-delimited ASCII files, is represented as the embedded typesetting command ''. ASCII values higher than 127 for accented characters are also used where available in the Hewlett Packard 'PC 8' symbol set.

Where not available, or where the keyboarder is unsure, these accented characters and other typesetting commands are embedded in the text within single chevrons (examples: 'a', 'u', 'o', 'e', 'e', 'l', 'i', 'z', '', '', '', '', '' etc.). Leading, trailing and duplicate spaces are automatically eliminated. Where information is in other alphabets, the national standards tend to use ASCII values higher than 127 for those alphabet characters, and those standards are followed, with switches to indicate the change in character set from and back to the Hewlett Packard 'PC 8' series (e.g. '', '', ''). Although its practice is probably similar to what has just been described, the BMS appears not to have any clearly-defined general data conventions. Data problems

At first glance, the information involved in recording fungi does not seem too complex: like the lady from Khartoum in the limerick, all we seem to need to know is 'on what, where, when, and by whom?' A typical record might read: 'Ascodichaena rugosa, on dead twig of beech, Chobham Common, Surrey, England, 8 October 1994, identified by M. Cooke'. Within that record, we can identify the different main data elements: the fungus name, the substratum, an associated organism, the exact locality, county and country, the date of observation, and name of the person identifying the fungus. The trouble is that experience quickly shows how each of these apparently simple elements can contain problems.

Fungi are organisms with unstable names and an unstable taxonomy: what is an acceptable name for a distinct species may, as opinion changes, become a synonym of a different species (quite possibly in a very different taxonomic group). It is not clear whether the substratum represents the material on which the fungus was growing, or its source of nutrition or both. The associated organism is referred to by a vernacular rather than a scientific name. Exact localities can change their names: the city of Milton Keynes didn't exist fifty years ago. So can counties: Avon, Cleveland and Gwynedd are all examples of county names created through local government re-arrangements over the last fifty years. Even countries can change: the USSR ceased to exist as a country in the early 1990s. The date of observation is usually straightforward, until, for example, records from Tsarist Russia (there are an awful lot!) are accessed: these were all generated using the unreformed calendar. Finally, the person making the observation may change their name following marriage. They may even have trans-sexual surgery, or an alter ego: examples of both are known in mycology!

It is clear from this brief analysis that information, even apparently simple and factual information, is inherently unstable. As a result, a most important principle adopted in the design of all the databases described in this chapter is that, wherever possible, the original information for each record is preserved in the form it was received into the system. The corollary of this is that each database has to be designed with additional fields, and sometimes even whole tables to permit the expression of different opinions as to what that original information means, and to record the dates on which and the people by whom those opinions were expressed. In the BMS Foray Records Database, there is little protection of original data: only the name of the fungus can be stored in both an original and current form. The Collections Database

This group of databases stores information about individual observations of organisms. The term 'observation' is interpreted in a wide sense. In addition to observations backed by physical material in a living or preserved reference collection, the term may also cover field observations and records derived from literature for which no physical material is available for examination. It is important to note that observations of unsuccessful searches are also stored in the Collections Database. The tables are structured so that observations of any type of living organism can be stored, and the ecological relationship between any individual organism and any other individual organism observed together can be noted. In the notes in the following two paragraphs, tables which contain fields of particular interest for foraying will be italicized, and the fields from those tables relevant to foraying will then be commented on.

Core information, including information for which indexed one-to-one links are required, is stored in the Collections Core Table ([Ccor......]). This table has fixed-length fields which are virtually all numeric pointers to other databases. Each record in the Collections Core Table represents a single observation of a single organism. Where several organisms are observed simultaneously in association with one another (for example a fungus on its host), a separate record is made in the Collections Core Table for each organism observed, but the records are linked through a common number in [CcorColnoN] (the Collections Core Table individual collection identifier). The Collections Core Supplementary Table ([Cco0......]) stores the remaining core information, particularly the longer textual information for which a variable-length field structure is more appropriate.

Information which requires indexing, and for which many-to-one links between the Collections Core Table and other tables are needed may be stored in the following series of cross-reference tables: Collections Administration Table ([Cadm......]); Collections Bibliography Cross-reference Table ([Cbix......]); Collections Description Cross-reference Table ([Cdex......]); Collections Illustration Cross-reference Table ([Cilx......]); Collections Culture Collection Table ([Ciso......]); Collections Culture Collection Maintenance Table ([Cisx......]); Collections Locality Cross-reference Table ([Clox......]); Collections Other Collections Cross-reference Table ([Coth......]); Collections People Cross-reference Table ([Cpex......]); Collections Properties Table ([Cpro......]); Collections Substratum Cross-reference Table ([Csux......]); Collections Taxonomists' Table ([Ctax......]); Collections Technicians' Table ([Ctec......]).

The principal uses of the Collections Database are to store information about fungi in living and dried reference collections, about occurrences of fungi reported in the literature and elsewhere, about observations of fungi backed by neither a literature reference nor a specimen, and about organisms associated with the fungi in all of those records. Information about unsuccessful searches is also stored here. This information can be used during curation of living and dried reference collections, administration of identification services, and can form part of standard catalogues and other publications, such as the Index of Fungi, Bibliography of Systematic Mycology and Distribution Maps of Plant Diseases. The information can also be incorporated in many research projects.

Because its design permits storage of data on collections of any organisms in any reference collection, and on observations arising from the literature and from foraying, and because its design permits these sources of data to be distinguished, this group of tables has important further uses in the production of a wide range of other scientific documents, for example host and country check lists. Examples of mechanically-produced publications deriving a large amount of their data from this group of tables are the checklists of fungi described by Batista and co-workers (Da Silva & Minter, 1995), and of fungi on Eucalyptus (Sankaran et al., 1996). Furthermore, the tables have the potential for expansion to cover identification work on any group of living organisms.

DATA STRUCTURES FOR BIOLOGICAL RECORDING

Specifications for databases to record occurrences of organisms

By D.W. Minter

CABI Bioscience, Bakeham Lane, Egham, Surrey, TW20 9TY, UK

Contents

Introduction