Data Structures for Biological Recording: CCOR database specifications

DATA STRUCTURES FOR BIOLOGICAL RECORDING

Specifications for databases to record occurrences of living organisms

THE CORE DATABASE CCOR

Introductory notes

[CcorLink_N]

A unique observation identifier, 8 characters, indexed

Each record in the Collections Database represents a single observation about a single taxon within one space of time and in one place. To ensure that each record can be identified separately, every record should be issued with a record number which is unique, stored in [CcorLink_N]. It is worth noting that, all over the world, many people are setting up lots of `collections databases', and they are all starting their `unique numbering' at the same point: number one. Sooner or later, there is going to be a need to amalgamate data from different sources, and when that happens, this multiplicity of so-called unique numbers is probably going to be a real problem. There are plenty of forward-lookers expressing concern about this. The BMS Databases at present do not have a properly-designated unique number of this type in any of their records. The BMS should consider restructuring its database to accommodate such numbers, and should participate in the setting up of a global observation numbering system.

[CcorAbestA]

A flag to mark an unsuccessful search, 1 character

As already noted, each record in the Collections Database represents a single observation about a single taxon at one time and in one place. It is human nature to remember discoveries, and quietly forget unsuccessful searches: not surprisingly, the default condition of records in the Collections Database indicates a successful discovery. In recording fungi, this seems particularly reasonable when it is remembered that most records relate to larger basidiomycetes and, in particular the Agaricales, where owing to their often ephemeral and unpredictable fruiting, serendipity is often the largest single factor in generating a record: `when I go on a foray, there are thousands of toadstools I don't see! You don't seriously expect me to record them?' Of course not.

On the other hand, an unsuccessful search at the right season of a site where Cantharellus has regularly been recorded for many years, but has been becoming less abundant could be significant. There is growing concern in European mycology about the decline of larger basidiomycetes as a result of pollution. How can the unwelcome but very real extinction aspects of that decline be recorded except through unsuccessful searches? And how is such information to be stored?

Besides, larger basidiomycetes are not the only fungi, and even of them many produce long-lived fruitbodies. Among the numerically far larger ascomycetes, very many produce identifiable colonies, stromata, thalli and fruitbodies which are long-persisting. For these, serendipity is a far smaller factor: if the correct associated organism or other substratum is present, the recording of an unsuccessful search becomes all the more meaningful.

Without an ability to record unsuccessful searches, our recording system remains open to the criticism that its distribution maps merely show where mycologists have been. Without it, there are no means to record the often considerable time spent looking for fungi in marginal habitats. Without it, there is no evidence that a fungus which appears to be increasing its range was not simply unnoticed before. Without it, there is no evidence that an apparent decline is not merely due to lack of attention from present day mycologists. To record an unsuccessful search is not the same as saying the fungus is absent, but it is meaningful, and often important. IMI holds several hundred records of unsuccessful searches. Neither BMS database has provision for recording unsuccessful searches.

[CcorAcclkN]

A numeric link to the currently accepted organism name, 8 characters indexed

The second option for a link field is to allocate a unique number to each record in the Nomenclature & Taxonomy Database, and to use that number to identify the currently accepted organism name. Provision for this option exists within the IMI database structure, but at present it remains unused. It may be appropriate now to consider the relative advantages and disadvantages of the two options.

The advantage of the numeric option is that the link information is compact, and of a uniform size for all records. Since computers are, ultimately, number-crunching machines, having link information in the form of, say, eight bytes of binary data is simpler for the computer to process, and the presentation of data derived from links is likely to be faster. Since the database software is likely to reserve the same number of bytes for this link in all records, it follows that if the link is changed (perhaps a different name has become the accepted one), the change in the data is unlikely to entail a restructuring of the location of the data on the hard disk. The text option is disadvantageous for precisely the same reason.

The big disadvantage of the numeric option is that it is hard for humans to use. A table composed purely of numbers will never be so easy to used as a table containing data which is meaningful to humans. `Fagus sylvatica' has a familiarity to the field biologist which `0009238' will probably never have! Of course it is possible to devise systems for data-entry whereby the user never comes into contact with such numbers, but these systems add a whole tier of complexity to the interactive software providing the user editing and viewing facilities, and in any case, as anyone who has had to deal with corrupted data will confirm, it is far easier to identify that there is something wrong with `Fagxs syxvatxxa' and to guess what it originally meant than it is to do the same for `0019238'!

With [Cco0OrinaA] and [Cco0AccnaA] or [CcorAcclkN] it is possible to store the primary data, and the present editorial opinion of the identity of the current organism for any Collections Database record. The separate storing of these two items has a particular advantage when dealing with organisms with unstable names, in that it is possible to change the current opinion without destroying the original data. It is analogous to drawing a polite pencil line through the original identification on a herbarium packet, and writing your own opinion below. Of course, it may happen that a particular record is controversial, and different scientists may identify the organism differently. Provision for such events is considered later in the Collections People Cross-reference Table ([Cpex......]).

[CcorCabi_A]

CABI individual collection identifier, 12 characters

[CcorEnqlkN]

CABI identification service enquiry identifier, 8 characters

[CcorColnoN]

A unique collection identifier, 8 characters, indexed

The recording of higher plants, and of many animals is very frequently carried out with no regard for other organisms with which they might be associated: those who map the distribution of forest trees rarely bother to note in their databases the many fungi, insects or other smaller living things which live on, in or under these huge plants. Mycologists can be proud that they have done rather better for many years. Because fungi are heterotrophs, when observing a fungus it is normal to note at least one, and often several other organisms associated with it: phrases such as `Amanita muscaria, on soil beneath birch, with hazel and oak nearby', or `Beauveria bassiana, on a lepidopteran larva on Morus sp.' are common in the history of recording fungi.

The first computerized `collections databases' designed for recording fungi simply copied the practice of old printed records. Each observation of a fungus was regarded as being a separate record, and information about associated organisms was treated as subsidiary data. There are undoubtedly a lot of advantages to this procedure from the point of view of easy gathering of data: mycologists tend to think in terms of the fungi they observe, and the associated organisms, while interesting, and even quite often important, are nevertheless merely an adjunct to the main fact that a fungus has been observed. Not surprisingly, the present BMS Databases reflect this viewpoint: within one record, there are fields for the fungus name and information associated with that fungus, and there are other fields for the name of an associated organism, with its associated information.

A little reflexion shows, however, that such a structure is very problematic for long-term storage of the data. In the first place, there is provision for information about only one associated organism, yet it is the norm that many organisms can live together in associations of many different kinds. During recent compilation at IMI of the records of microfungi from Brazil made by Batista and co-workers, a very large proportion of observations related to multiple groups of organisms, to fungi, algae, higher plants and animals all associated with each other in various ways. A structure which permits each record to have only one fungus and one associated organism simply cannot express such complexities.

The second problem is that in a significant number of records, those dealing with fungi parasitic on other fungi, one of the two fungi must be an associated organism if the structures currently used by the BMS are employed. The person generating a record of Eudarluca on a rust, for example, will put the rust as the fungus and the Eudarluca as the associated organism, or the other way round, depending on their particular mycological point of view. The result is that, to find all records of Eudarluca by a mechanical search, not only the fungus fields, but also the associated organism fields have to be scanned.

The third problem is that this simple structure encourages those generating records to treat information relating to associated organisms as, in some way, second class data. Thus there is no provision within the BMS Foray Records Database to note down who identified the oak, or when they identified it, and you cannot record how many oaks were observed, nor in what condition they were. This objection may seem trivial to recorders in Britain (`all British field biologists know what an oak looks like'), but on a global scale it starts to look important. If British records are to be used by a researcher in the tropics who is unfamiliar with the genus Quercus, information about who identified this host could be important, and British researchers would similarly appreciate knowing who identified the Acacia in mycological data coming from Africa.

The fourth difficulty is that, by treating associated organism information as second class data, our databases are not functioning efficiently. During a year of collecting data on fungi of Ukraine, of the 24,000 records gathered, over 23,000 contained floristic information about higher plants of Ukraine. Properly structured, our mycological records are also a vast resource of data on the occurrence, distribution, ecological preferences and associations of many other organisms.

The solution to this problem is to recognize that, for long-term storage of data, each fungus and each associated organism observed actually represent separate floristic records. Thus, if Polyporus squamosus is observed on sycamore, two records should be generated in the Collections Database, one for Polyporus squamosus, and the other for sycamore. If Eudarluca is recorded on a rust on Picea, then three records should similarly be generated. If thirty-eight different insects, twelve fungi, three nematodes, and two spiders are observed on the leaf of one plant, fifty-six records should be generated: with such a solution, possibilities for recording relationships become open-ended. For the BMS Databases to achieve this, major restructuring will be necessary.

To link this cluster of different related observations made at the same time, a unique collection identifier is needed (`collection' is not used here in the sense of `herbarium collection'). This should be a number, stored in [CcorColnoN]. The BMS Foray Records Database and the BMS/JNCC Database do issue such numbers ([BMS Accession Number] and [BMSFRD Record number] respectively), since each record in their structures usually comprises a collection of two different observations made at the same time, one for the fungus, the other for the associated organism, but at present that database makes no distinction between the unique observation identifier and the unique collection identifier.

Treating different observations made at the same time and place as different records in the Collections Database has implications for information about substratum and ecological relationships. That is another area where the structures of the BMS Databases are inadequate for properly-structured long-term storage of information. These implications will be considered later.

[CcorIndivN]

A unique individual identifier, 8 characters

According to the Guinness Book of Records, the largest individual fungus fruitbody in the world is to be found in Kew Gardens, in what used to be the grounds of IMI. It is regularly and proudly but anxiously measured. If that information were to be recorded in database format, its occurrence in time and space would be floristic information, and its size would be descriptional, but for the sequence of observations to be meaningful, there would have to be some way of noting that they all came from the same individual. The present field provides the opportunity to issue an individual organism with a number which is unique, so that linked sequential observations on that individual can be stored in the Collections Database (and by linkage in the Descriptions Database and the Illustrations Database).

What constitutes an individual in the fungal world is rarely so straightforward as the case just considered, and doubtless some of us would quite reasonably argue that on a genetic basis even that huge fruitbody represented a population rather than an individual. But the Collections Database deals not just with fungi, but with any organism, and in particular with organisms associated with fungi. It means that this field can be used, for example, to define a particular tree from which regular samples are being taken, and the presence of this field opens the door to researchers wishing to use the Collections Database as a vehicle for sequential observations of many different sorts. The BMS Foray Records Database and the BMS/JNCC Database do not have this field, but could easily have a need for it if, for example, the database were to be used as part of a strategy for conserving rare polypores.