Classification and Metadata

(prepared for an ACS Summer Workshop, "Planning Digital Collections for Education and Research", Southwestern University 19-21 June 2003)

Classification and Metadata Resources

Collections consist of items, and the most basic task is organizing them: sorting into categories, naming the sets and subsets. A good general background resource on classification is Bowker and Star 2002 Sorting Things Out: Classification and its consequences

my extracts include a lot of the essential general points. The book provides extended examples from the International Classification of Diseases (ICD) and Nursing Interventions Classification (NIC).

Here's a nice example of a Problem: the items here are a set of images of ...xxx...

(from http://www.dlib.org/dlib/june02/shabajee/06shabajee.html)

(from a google image search)

We can say pretty clearly what these images are about (Rafflesia spp, the largest flower in the world, found in Malaya, Sumatra, and Borneo), but handling them in a collection of images presents some problems. How are they to be retrieved? What needs to come with them (explanatory text, pointers to related resources, attribution, geographical coordinates...?)

For any collection, it's a good idea to think through these questions before going on to the question of how to organize and classify the component items:
what are the ITEMS in your collections?
Are there different types of items that have different characteristics?
What are the elements of each type of item?
What elements are mandatory/optional?
Will items have pedagogical notes? Annotations?
Will items be developed by your team, or accessed from other locations?
Will you need to track multiple exemplars, i.e, masters and copies for use?

How will inclusion in your library add value to items?
Increased access through more users?
More detailed access (for example full-text search)?
New use through associations not possible before?
What subsidiary information will be tied to items?
How do copyright laws pertain to your items?

Once the requirements and intentions are clearly established, the organizational tasks can begin. In most cases, we work primarily with words: our problems are to a considerable degree lexical and syntactical. Indexes are a basic way of taming the lexical.

There are indexing methods with various degrees of constraint:

a keyword index is uncontrolled;
an indexing langage is controlled, but not structured (the headers in the Yellow Pages are an example);
thesauri are both controlled and structured (specify synonyms, equivalents, broader and narrower terms, related terms --MeSH [Medical Subject Headings], Art and Architecture Thesaurus);
classification systems are controlled, structured, and coded (Library of Congress Subject Headings, Dewey Decimal Classification --see Renardus for an application of Dewey Decimal to Web content)

For some domains the choices of terminology seem pretty clear, because existing thesauri, controlled vocabularies, and ontologies may be appropriate sources for standardized and consistent (and authoritative) sets of terms appropriate to items and collections. See

A-Z of Thesauri from the High Level Thesaurus Project
and
Traugott Koch's Controlled vocabularies, thesauri and classification systems
for some examples (alas, some links don't work...).

When no single controlled vocabulary seems appropriate, it's necessary to construct a pragmatic terminological/organizational system that can handle the variety to be classified. If nobody but the creator ever uses the classification, it can be idiosyncratic... thus, I can [usually] find the books and records in my own collections, but nobody else can. If a collection is ever to be used by others, it may eventually become necessary to transform an ad hoc and personal classification into a standardized one. Some would argue that one should begin with an orderly system... but that's a matter of personal style.

If one needs to consider building a classification system, or adapting an existing system to one's own materials, some of these resources from the Web may be helpful:

What is a Controlled Vocabulary, and how is it useful? (basics, primarily for image collections --but links are worth following)
Flow of work in thesaurus construction (diagram)
Keywording Theory
Keywording systems fall into either one of two styles.
Most commonly used, although not always the best, is the open vocabulary. If you look at an image and use whatever words it inspires for the keywords, you're using an open vocabulary. One day you may keyword an image as Ship, Harbor. Entered on another day, you might have keyworded the very same image as Boat, Port. Another time you might use Watercraft and Harbour.
When searching for this sort of image, who knows what you’ll find or overlook because of inconsistencies in an open vocabulary. Of course, you can look for all entries that contain Harbor or Port or Anchorage or Boat or Ship or Vessel, but it’s hit or miss proposition that requires a vast patience or a good memory. If you find no entry for Boat, you must keep looking, trying other words in your keywords. When you discover Ship, you move on, never discovering all the others that got listed as Watercraft.
In a controlled vocabulary your keywords are limited to a specific set of words. Ideally this list of words is printed in an index or available on a computer for your reference as you apply keywords to images.

Converting a controlled vocabulary into an ontology: the case of GEM [Gateway to Educational Materials] (Jian Qin & Stephen Paling)
The Role and Use of Controlled Vocabulary in Increasing the Sharing and Use of Ecological Data Information International (PowerPoint presentation, 1998 conference)

Decoding the acronyms in this realm is a constant challenge. MetaMap from U. Montréal is a lovely example of presentation of complex information.

Bryan Alexander contributed this example of a digital library project, a nice example of an in-process project:

The Online Burma/Myanmar Library (http://www.ibiblio.org/obl/) is a database which functions as an annotated, classified and hyperlinked index to full texts of individual Burma documents on the Internet. It also houses a growing collection of articles, conference papers, theses, books, reports, archives and directories on-site (e.g. the 17MB archive of the Burma Press Summary). The Librarian requests help from specialists to refine the structure and add content.
Need for the Library
The Internet currently holds more than 100,000 Burma-related documents, from short news items to complete books, scattered over more than 500 websites (not all of which have internal search functions) run by the UN system, governments, academic institutions, media sites, listserv archives, human rights and other NGOs, activist groups and individuals. The volume is growing rapidly as more and more organisations choose to publish on the Internet. Even using modern search engines it is difficult and time-consuming to research this widely-scattered material. There is clearly need for a central index.
Structure
This is what the Online Burma/Myanmar Library seeks to provide. Launched in October 2001, it is organised on a database (using MySQL software, in combination with PHP) into 60 top-level categories based on traditional library classifications, with a hierarchy of some 850 sub-categories. These hold approximately 4000 links (mostly annotated, with keywords and descriptions) to individual documents, and about 400 links to websites which in turn give access to another 100,000 or so documents...

Another example of a digital collection under development is my own summer project on access to Civil War materials from W&L's Special Collections. Consider one of the gems our online catalog offers: the Annie record for the Frank Reader diary --which is an end-user version of a MARC record that only a librarian would care about. To add value and make the resource more accessible than it can be tucked away in Special Collections, we have digitized the diary, both as .jpg images of the pages and by transcription of the text. Carrying the process one step further, I have linked the specific place references via a geographical interface. We are now working out the infrastructure for adding more items and their necessary metadata.

Metadata
Gail Hodge cuts to the chase:

Metadata is structured information
that describes, explains, locates, or otherwise makes it easier
to retrieve, use or manage an information resource.
(http://www.niso.org/news/Metadata_simpler.pdf)

Many writers have been exquisitely boring on this subject, but it's really pretty interesting once one gets past the "data about data" cliché, and into the practicalities. On the way, let's take a detour through Wikipedia's entry and look at the bottom of the page... and compare with METADATA®. A lost cause...

It can be argued that a collection that will only be used by its creator doesn't need explicit or orderly/systematic metadata... but it's remarkable how often something that begins as a private collection grows into a public resource, and in any case it's a good thing to learn a few of the basics of metadata examples that are being talked about pretty widely.

Different Types of Metadata and Their Functions
(from Anne J. Gilliland-Swetland "Defining Metadata")
(http://www.getty.edu/research/institute/standards/intrometadata/2_articles/index.html)

Type Definition Examples

Administrative Metadata used in managing and administering information resources - Acquisition information
- Rights and reproduction tracking
- Documentation of legal access requirements
- Location information
- Selection criteria for digitization
- Version control and differentiation between similar information objects
- Audit trails created by recordkeeping systems

Descriptive Metadata used to describe or identify information resources - Cataloging records
- Finding aids
- Specialized indexes
- Hyperlinked relationships between resources
- Annotations by users
- Metadata for recordkeeping systems generated by records creators

Preservation Metadata related to the preservation management of information resources - Documentation of physical condition of resources
- Documentation of actions taken to preserve physical and digital versions of resources, e.g., data refreshing and migration

Technical Metadata related to how a system functions or metadata behave - Hardware and software documentation
- Digitization information, e.g., formats, compression ratios, scaling routines
- Tracking of system response times
- Authentication and security data, e.g., encryption keys, passwords

Use Metadata related to the level and type of use of information resources - Exhibit records
- Use and user tracking
- Content re-use and multi-versioning information

Type	Definition	Examples
Administrative	Metadata used in managing and administering information resources	- Acquisition information - Rights and reproduction tracking - Documentation of legal access requirements - Location information - Selection criteria for digitization - Version control and differentiation between similar information objects - Audit trails created by recordkeeping systems
Descriptive	Metadata used to describe or identify information resources	- Cataloging records - Finding aids - Specialized indexes - Hyperlinked relationships between resources - Annotations by users - Metadata for recordkeeping systems generated by records creators
Preservation	Metadata related to the preservation management of information resources	- Documentation of physical condition of resources - Documentation of actions taken to preserve physical and digital versions of resources, e.g., data refreshing and migration
Technical	Metadata related to how a system functions or metadata behave	- Hardware and software documentation - Digitization information, e.g., formats, compression ratios, scaling routines - Tracking of system response times - Authentication and security data, e.g., encryption keys, passwords
Use	Metadata related to the level and type of use of information resources	- Exhibit records - Use and user tracking - Content re-use and multi-versioning information

There are many metadata Standards (Biology [NBII], Spatial Data [FGDC ], Darwin Core for natural history collections, Encoded Archival Description [EAD]... and many others). See an excellent Canadian Government introduction to the general subject.

Addendum: I should have known about and included METS, Metadata Encoding & Transmission Standard, but I confess I hadn't heard of it in June...

Dublin Core is probably the most commonly-mentioned metadata standard: 15 Elements, extendable ('qualified DC') and readily applicable to many domains, adaptable to others, and widely used in Web settings. Not all things to all people, but a good place to begin.

Dublin Core main pages and Projects, and some of the [seemingly] most important documents:

Using Dublin Core
DC Metadata Element Set
DCMI Metadata Terms
Some helper apps:
Dublin Core metadata editor
Type the URL of the page you want to describe... This service will retrieve a Web page and automatically generate Dublin Core metadata, either as HTML <meta> tags or as RDF/XML, suitable for embedding in the <head>...</head> section of the page. The generated metadata can be edited using the form provided and converted to various other formats (USMARC, SOIF, IAFA/ROADS, TEI headers, GILS, IMS or RDF) if required. Optional, context sensitive, help is available while editing.

Dublin Core Metadata Template
This service is provided by the Nordic Metadata Project in order to assure good support for the creation of Dublin Core metadata to the Nordic "Net-publisher" community. If you use the metadata created by this form and follow our examples, term lists and recommendations, your HTML documents will carry high quality metadata.
(see also Short and Simple Template)
DONOR metadatagenerator (also offers DONOR metadata generator "for lokal use" (tar.gz) from Netherlands Royal Library)

To translate/map from one standard to another, one may use a crosswalk:

Crosswalks: The Path to Universal Access? (Mary Woodley)
All About Crosswalks (Jean Godby)
MARC to Dublin Core Crosswalk

A Guide to the Virginia Canals and Navigations Society Collection 1978-1983 (an example of an EAD record on the Web, a Finding Aid linked via W&L's OPAC)

Looking Ahead
I seem always to accumulate piles of books and papers that are fraught with portent for the future, and the 'digital library' piles certainly outrun what I know enough to talk about in mid-June 2003. Here I'll mention some subjects and materials that I'm in the process of trying to integrate, largely as a heads-up for topics that will surely assert themselves in the near future.

Extracts from Primary Multimedia Objects and 'Educational Metadata': A Fundamental Dilemma for Developers of Multimedia Archives (Paul Shabajee)
Semantic Web from w3.org (awaiting my attention is The Semantic Web : A Guide to the Future of XML, Web Services, and Knowledge Management --see the companion Website for this book, which includes a sample chapter) by Michael C. Daconta (Wiley 2003)
Latent Semantic Indexing (LSI): "LSI can retrieve relevant documents even when they do not share any words with a query..." (Telcordia Technologies).
Here's an area where CiteSeer is an especially useful resource, because so much of the development has taken place via Web exchange of working documents. See the record for Christos H. Papadimitriou et al. Latent Semantic Indexing: A Probabilistic Analysis (1998) for citation links, and note the Related button... To explore this tip-of-iceberg, try demos from Telcordia
See also the LSI section from the excellent "Patterns in Unstructured Data: Discovery, Aggregation, and Visualization" from a CET/NITLE presentation to the Mellon Foundation

Personalisation and Recommender Systems in Digital Libraries: Joint NSF-EU DELOS Working Group Report May 2003 Jamie Callan et al.
Digital libraries must move from being passive, with little adaptation to their users, to being more proactive in offering and tailoring information for individuals and communities, and in supporting community efforts to capture, structure, and share knowledge...
...if the library can tailor its services and materials for a wider range of users, the impact and utility of the library is magnified greatly. The next generation of digital libraries must support a wide range of personalized services that support the activities of a wide range of users.
About XML
One of the clearest expositions I've found on the subject is XML, the third way, Sebastian Rahtz's presentation from a 1999 conference. I'm wrestling with this topic myself, and have a heap of things to read and absorb... but I think I'm almost getting it. I should also know much more than I do about the Text Encoding Initiative (see a list of projects, and a D-Lib Magazine article Designing Documents to Enhance the Performance of Digital Libraries: Time, Space, People and a Digital Library on London by Gregory Crane of the Perseus Project)
Some of the connections between XML and Dublin Core are also on my just-over-the-horizon list:
Guidelines for Implementing Dublin Core in XML
Expressing Qualified DC in HTML/XHTML meta and link elements
Expressing Simple DC in RDF/XML
Expressing Qualified DC in RDF/XML
From another valuable summary:
XML is not a solution in itself, but a new way of approaching the often-taxing area of content structuring and reuse. In isolation, an XML file does very little, but it is through the combination of XML with dedicated 'helper' utilities that its massive power becomes harnessable and apparent. (2)
XML defines a framework that can be used to create solutions, but in isolation does very little apart from produce highly readable and organised documents. (5)
(from The XML Family of Technologies, Technology Watch Briefing 7, DigiCULT)
Two 'helper' utility examples:
XML Linking Language (XLink --"allows elements to be inserted into XML documents in order to create and describe links between resources...")
XML Path Language (XPath --"a language for addressing parts of an XML document...". See chapter from XML in a Nutshell for more)
An example of an XML-based digital library is The Centre for Whistler Studies' correspondence of James McNeill Whistler.

If there's a single resource to keep an eye on, it's probably D-Lib Magazine,11 issues a year and all available online and free. It's there that I found a pointer to the DigiCULT article linked above, which was a real eye-opener: many pennies dropped and realizations dawned. DigiCULT, which "monitors and assesses resarch and technological developments in and for the cultural heritage sector in Europe" is itself a valuable site to visit and revisit.
Another organization to know about is ERPANET (Electronic Resource Preservation and Access Network), a European consortium --see, for example, XML as a Preservation Strategy (report of an October 2002 workshop)

On copyright: Copyright Issues Relevant to the Creation of a Digital Archive: A Preliminary Assessment y June M. Besek (Commissioned for and sponsored by the National Digital Information Infrastructure and Preservation Program, Library of Congress)

On the economics of digital library projects: Journal of Digital Information Volume 4 Issue 2 - Economic Factors of Digital Libraries