| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Culture Grid Profile

Page history last edited by Mia 13 years, 4 months ago

Culture Grid and Metadata – Object/ Item records

 

[Mia: I'm adding a bit of context cos not everyone will be familiar with the Culture Grid... As an indication of the type of content that's available through the Culture Grid, I've copied this text from some of their about pages: "It contains over 1 million records from over 50 UK collections, covering a huge range of topics and periods.  Records mostly refer to images but also text, audio and video resources and are mostly about museum objects with library, archive and other kinds of collections also included."  So, that's:

  • "information about items in collections (referencing the images, video, audio or other material you offer online about the things in your collections)
  • information about collections as a whole (their scope, significance and access details)
  • information about collecting organisations (contact and access details)"

 

 

A bit of history …

An application profile of Dublin Core was produced by UKOLN, in consultation with Knowledge Integration and MLA, in 2005 for use with the Peoples Network ‘Discover’ Service.  Conformance to this profile, known as PNDS DCAP, was (in theory at least) mandated for all collections which received MLA funding to make their metadata available to PNDS.  The method of supply was also stipulated to be OAI-PMH.

As time progressed, this simple vision became somewhat diluted and the purity of the vision compromised.  Many projects encountered problems.  Implementing OIA-PMH, producing valid XML, producing well-formed XML and populating mandatory elements were all issues.  At the aggregator level, it was decided to be as lenient as possible to allow the maximum number of records to be ingested.  We also developed ‘data push’ interfaces to allow access to contributors who could not support OAI-PMH. As a result of ‘political’ pressure, it was decided to incorporate existing data sets that did not conform to PNDS DCAP.  It was also decided to allow some extensions to the profile to support a project (Exploring 20th Century London) which wanted to use the aggregation platform but required additional metadata elements.

We therefore reached the position where we had a large corpus of metadata records but the quality of the records was extremely variable.  The decision to use the platform as the national aggregator platform for Europeana meant that there was now a requirement to map some of this metadata to the Europeana Semantic Elements metadata profile (ESE).

 

What we do now

At the moment we have handlers for oai_dc and pnds_dc formatted metadata.  These are invoked on ingest whether the metadata is supplied via oai-pmh or via upload.  These handlers transform the metadata to our canonical format for storage in a database but also create alternate representations of the metadata (e.g. ESE) as required.  The advantage of creating these alternate representations on ingest, rather than ‘on the fly’, is that it vastly improves system performance when the grid is presenting search results or acting as an OAI-PMH target.

In addition to these standard handlers, we also have custom handlers for some data providers.  These are used to correct errors in the supplied data and sometimes to add missing values.  The creation of custom handlers is very resource intensive and is not something we wish to continue in future.

One thing we learnt at an early stage was never to throw away any data submitted to us or harvested by us.  Re-harvesting collections is such a hit-and-miss process that we always preserve a copy of the records in their original (raw) format.  This means that we can go back and re-process records if mappings change or if a new transformation is required.Whilst this is a useful safety net, it is also resource intensive and should only be used as a last resort, not as part of a general strategy for coping with new requirements.

 

Why Change?

Our experience of ingesting more than 1 million metadata records from over 50 collections has highlighted several issues with PNDS DCAP.  A few of these issues are intrinsic to the profile itself but many are related to inconsistent interpretation of the requirements by data providers.  Also, the increasing requirement to create ESE representations of supplied data has highlighted the incompatibilities between these profiles.  The difficulty in transforming PNDS DCAP records into ESE is increasing with each new version of ESE.  Indeed, the fact that Europeana’s requirements are such a moving target is a problem which must be shared by other aggregators.

Amongst the main issues with the current use of the Grid are:

  • The use of dc:identifier to point to “the URI of the resource” does not work well when there are multiple renditions of a single item (e.g. different views of a 3 dimensional physical object or different encodings of a multimedia file).  It is not possible at the moment to make the distinction made in ESE between isShownAt and isShownBy.
  • There is a lot of confusion over whether some of the metadata elements in PNDS DCAP should be used to describe the original artefact, its digital rendition or the metadata record itself.  This is particularly apparent with the dcterms:license and dcterms:rightsHolder elements.  In practice, it appears that data providers populate these and other elements with whatever metadata they have already, resulting in a mixture of metadata describing digital and physical objects in the Grid.
  • Ad hoc extensions to the original DCAP have been made to accommodate projects such as Exploring 20th Century London, but the presence of these extended element sets has not been well publicised.  What is required is either an agreed set of generic extensions or a better defined extension mechanism to allow data providers to define their own custom extensions.
  • Although a cursory glance at PNDS DCAP and ESE suggests that there are a large number of common elements, closer analysis reveals differences in the encoding schemes mandated for these elements.  It is worth reviewing these to see if either common vocabularies could be used or if crosswalks between vocabularies can be defined.

 

Options

If we were to make a change to the recommended profile for submitting metadata to the Grid there are 4 main options:

 

1     Adopt ESE

Adopting ESE as the preferred format for submitting metadata would obviously aid onward transmission to Europeana (although the submitted records would still need to be processed to add the europeana:provider element).  Whilst there is certainly a case for making this one of the acceptable formats for data submission, making it the preferred format would have some disadvantages.  These include:

  • ESE does not include all the data elements that existing contributors feel are important.  This is particularly apparent in the case of participants in Exploring 20th Century London.  The ability with PNDS DCAP to point explicitly to a thumbnail rendition (to be cached locally on the Grid) is also much appreciated by data providers.
  • ESE does not make elements such as dc:title mandatory.  This can cause problems with formatting hit list displays, etc.
  • Not all data providers want their content to be included in Europeana and adopting a profile designed specifically around Europeana’s requirements may not appear appropriate to everyone.
  • The schema is relatively strict in that it enforces order, etc.  Users already have issues generating PNDS_DCAP, which is much more lenient.   Users are likely to find it more difficult to create valid ESE records than create valid PNDS DCAP

 

2     Adopt a more comprehensive metadata standard

A lot of people have pointed out the inherent problems in using simple Dublin Core based profiles to convey detailed metadata.  DC based profiles tend to be focussed on resource discovery metadata whereas some data providers wish to make more detailed curatorial metadata also available.  Based on the principle that it easier to map from a detailed representation of metadata to a simpler one than vice versa, there is a case for making a more comprehensive metadata format, such as CDWA, CIDOC-CRM or LIDO, the preferred submission format.

The main disadvantages of this approach are:

  • Such a change would require extensive modifications to the Grid’s internal data model.  It is not clear whether, even if the resources could be found for this work, the demand for this level of detail would justify the costs.
  • One of the strengths of the Grid is that it allows contributions from a range of data providers, some of whom already find the requirements of a simple DC based metadata profile extremely demanding.  Adopting a more complex data structure, even if the majority of data elements were optional, would almost certainly discourage some of these smaller, niche, providers from contributing.

 

3     Extend PNDS DCAP, making it backwardly compatible

This would, in effect, be formalising the ad hoc arrangements that have already been made to cope with E20CL and similar projects.  The advantage of this approach is that existing metadata supplied in PNDS DCAP format would still be valid against the old profile.  The disadvantage is that it would not allow some of the problems with the original profile, such as the use of dc:identifier, to be addressed easily.

 

4     Produce a new profile and deprecate PNDS DCAP

This option leaves the maximum freedom to address the concerns of current and potential data providers, get real input from practitioners and address the problems found to date.  However, it does have the disadvantage that it adds further to the complexity of the situation with there being an even greater range of profiles to choose from.  For this reason, it is recommended that this option is only pursued if there is real buy-in from the community.

 

Recommendation

So long as there is community support, we feel that producing a new profile would provide the best platform for moving the Grid forward.  PNDS DCAP is around 6 years old and has served its initial purpose.  A forward looking strategy should not necessarily be constrained by it.  We therefore recommend that new profile is developed in consultation with the community and that the Grid’s ingest processing be updated to add a handler for records in this format.  To complement this, though, we would also recommend that the following policies be adopted:

  • The Grid should add an ESE handler to allow ingest processing of records supplied in ESE format.
  • PNDS DCAP records should still be accepted and processed, at least in the short to medium term.
  • Except in exceptional circumstances, the Grid should stop accepting new records supplied in oai_dc format.
  • Irrespective of the supplied format, the grid should continue to retain the original (_raw) version of uploaded metadata records and make these available to third parties if requested.
  • The internal data model of the Grid should be updated to reflect the new profile.
  • A mapping tool should be developed which allows metadata to be submitted in a wider range of formats (so long as mandatory elements are present).  These formats could include LIDO as well as native export formats from common Collection Management Systems.
  • The community should be consulted about the possibility of introducing the concept of multiple conformance levels for the new profile.  This would allow users to request only record which conform at a certain level.  This might encourage data providers to ensure that their metadata is of the highest possible quality.

Assuming that the above recommendations are accepted, we present a ‘straw man’ recommendation for the elements to be included within such a profile, along with some indication of the obligation requirements that should be placed against these elements.  Note that this is NOT intended to be a profile in itself.  We are looking to the community for input in defining the data model and producing XML bindings, schemas, conformance test, etc.

 

Annex 1       Comparison of PNDS & ESE

PNDS

ESE

Comments

Required/ Strongly Recommended Elements

dc:identifier

Additional

 

dc:title

Strongly Recommended

 

dc:description

Recommended

 

dc:subject

Recommended

 

dc:type

Recommended

specific encoding scheme

dcterms:license

Not Present

 

dcterms:rightsHolder

Not Present

 

Not Present

dcterms:alternative

 

Not Present

dc:date

 

Not Present

dcterms:created

 

Not Present

dcterms:issued

 

Recommended Elements

dc:creator

Strongly Recommended

 

dc:contributor

Strongly Recommended

 

dc:publisher

Recommended

 

dc:language

Recommended

 

dcterms:spatial

Recommended

specific encoding scheme

dcterms:temporal

Recommended

specific encoding scheme

dcterms:audience

Not Present

specific encoding scheme

dcterms:isPartOf

Recommended

 

pnsterms:thumbnail

Not Present

 

Not Present

dc:coverage

 

Not Present

dc:source

 

E20CL Extension Elements

dc:relation

Additional

used for related object

e20cl:materials

Not Present

= dcterms:medium?

e20cl:size

Not Present

 

e20cl:creditLine

Not Present

 

e20cl:relatedSubject

Not Present

 

e20cl:relatedPerson

Not Present

 

e20cl:relatedOrganisation

Not Present

 

Optional/ Additional Elements

dc:format

Additional

 

dcterms:spatial

Recommended

specific encoding schemes

dcterms:temporal

Recommended

specific encoding schemes

Not Present

dcterms:extent

 

Not Present

dcterms:medium

= e20cl:materials?

Not Present

dcterms:rights

similar to dcterms:license

Not Present

dcterms:provenance

 

Not Present

dcterms:conformsTo

 

Not Present

dcterms:hasFormat

 

Not Present

dcterms:isFormatOf

 

Not Present

dcterms:isVersionOf

 

Not Present

dcterms:hasPart

 

Not Present

dcterms:isReferencedBy

 

Not Present

dcterms:references

 

Not Present

dcterms:isReplacedBy

 

Not Present

dcterms:replaces

 

Not Present

dcterms:isRequiredby

 

Not Present

dcterms:requires

 

Not Present

dcterms:tableOfContents

 

Europeana Extension Elements

Not Present

europeana:country

 

Not Present

europeana:dataProvider

 

Not Present

europeana:hasObject

 

Not Present

europeana:isShownAt

 

Not Present

europeana:isShownBy

 

Not Present

europeana:language

different to dc:language

Not Present

europeana:object

 

Not Present

europeana:provider

 

Not Present

europeana:rights

different to dc:rights

Not Present

europeana:type

different to dc:type

Not Present

europeana:unstored

 

Not Present

europeana:uri

 

Not Present

europeana:userTag

 

Not Present

europeana:year

 

 

 

Annex 2       Straw Man proposal for data elements in CG profile

 

Element

Repeatable?

Ordered?

Comments

Required/ Mandatory

dc:identifier*

N

-

only one of these elements should be populated.  Remove dc:identifier altogether?

europeana:isShownAt *

N

-

europeana:isShownBy *

N

-

dc:title

Y

Y

if multiple titles, first one used in display

dc:description

Y

N

 

dc:type

N

-

Align encoding scheme with ESE?

dcterms:license

N

-

not sure if should be mandatory?

dcterms:rightsHolder

N

-

not sure if should be mandatory?

Recommended

dc:creator

Y

N

 

dc:subject

Y

N

 

dc:date

Y

N

 

dc:contributor

Y

N

 

dc:publisher

Y

N

 

dc:language

Y

N

 

dc:coverage

Y

N

 

dc:source

Y

N

 

dc:relation

Y

N

use for related person, subject, organisation & object via XML attribute

dcterms:spatial

Y

N

which encoding schemes?

dcterms:temporal

Y

N

which encoding schemes?

pnsterms:thumbnail

Y

Y

1st specified is used in hit list display

Optional

dc:format

Y

N

 

dcterms:alternative

Y

N

 

dcterms:audience

Y

N

 

dcterms:isPartOf

Y

N

 

dcterms:created

Y

N

 

dcterms:issued

Y

N

 

dcterms:extent

Y

N

 

dcterms:medium

Y

N

 

dcterms:rights

Y

N

 

dcterms:provenance

Y

N

 

dcterms:conformsTo

Y

N

 

dcterms:hasFormat

Y

N

 

dcterms:isFormatOf

Y

N

 

dcterms:isVersionOf

Y

N

 

dcterms:hasPart

Y

N

 

dcterms:isReferencedBy

Y

N

 

dcterms:references

Y

N

 

dcterms:isReplacedBy

Y

N

 

dcterms:replaces

Y

N

 

dcterms:isRequiredby

Y

N

 

dcterms:requires

Y

N

 

dcterms:tableOfContents

Y

N

 

e20cl:materials

Y

N

 

e20cl:size

Y

N

 

e20cl:creditLine

Y

N

 

europeana:dataProvider

Y

N

 

europeana:isShownAt

Y

N

 

europeana:isShownBy

Y

N

 

europeana:object

Y

N

 

europeana:rights

Y

N

 

 

 

Comments (10)

Vincent Kelly said

at 10:02 am on Oct 20, 2010

This is very interesting and timely I think prividing a mapping tool for contributors is essential particularly if it can seperate out different levels of data i.e., from mandatory to CWDA or similar. Which brings me onto a point I wanted to make about the BBC/PCF developing the Your Paintings website which will be making use of data provided by museums and galleries. Contributors have been told that data should comply to Getty CWDA Lite standards does this add any weight to the argument for adopting CWDA Lite as a core standard that can be built on for data mapping etc..? The other issue that should not be forgotten about is if you develop a new standard for exporting and importing data it should be prersented in a way that is managable for smaller organisations who may not have the resources of staff or money to produce data exports. That's my twopennorth worth to get things started.

Mia said

at 5:19 pm on Oct 24, 2010

Breaking into parts cos you can only leave 1000 characters worth of comment...

I don't have a sense of what contributors are doing when they contribute, and how closely the profile might match what they're already doing - are they generally contributing because it's a project requirement, or because they need to use and publish the data elsewhere (as for Exploring 20th Century London), as part of contributing to Europeana... ? If you don't intend to re-use the data yourself, or to offer it to others for re-use, then your requirements are going to be different to organisations that are using PNDS to publish for re-use.

Some other thoughts - I'm considering using LIDO for things like http://museum-api.pbworks.com/Science-Museum-linked-data so a mapping or transformation on ingest would be good.

On our pages, we're aiming for lots of links (ideally qualified e.g. maker, used by, inventor etc) to subject authorities like people, places and events. Support for links to subject authorities (again, as with X20CL) would be really useful.

Mia said

at 5:25 pm on Oct 24, 2010

Can you give providers guidance on whether they're meant to be providing metadata about an object or its representation? Or add a flag - object, surrogate/representation, mixed? Can you bundle different representations of the same object together in a meta-record?

My gut instinct is that DC:Identifier is useful but I haven't looked at europeana:isShownAt, europeana:isShownBy.

At linking museums meetups, developers always ask for licence info, so I'd suggest making it mandatory - it does mean an organisation has to decide on a licence, but hopefully they'd be doing that anyway. It is an area where guidance would be useful - perhaps the Collections Trust will be doing that?

Finally, what about the Europeana Data Model (EDM)? I know it's not meant to replace domain-specific schemas, but as a common output format for discoverability it's something we'd have to look at.

Mia said

at 5:25 pm on Oct 24, 2010

All very rough thoughts, sorry! I don't spend enough time in this space to have exactly the right terms so I may have munged concepts a bit.

cheers, Mia

jeremy said

at 8:41 pm on Oct 24, 2010

Yes, as Mia says, check out EDM 5.2 if you've not already done so. ESE will be deprecated, though EDM will be backwards compatible. It's concerned with modelling data within systems whilst LIDO is concerned with data interchange; they should work well together.
Cheers, Jeremy

andy.powell@... said

at 3:07 pm on Nov 1, 2010

Without knowing too much about the details, it looks to me as though the problems described in the "Why change?" section are largely caused by the flat-world model that underpins the use of DC metadata here. What I mean by that is that there is no 'application profile' model of the entities that need to be described (physical items, digital items, physical collections, digital collections, people, organisations, etc.), the key relationships between those entities, and the properties that should be used to describe each of them. I think it is inappropriate to 'blame' DC for this problem. It is the application of DC that is at fault (I say this as someone who was probably responsible in some way for the current usage!).

andy.powell@... said

at 3:11 pm on Nov 1, 2010

To move forward... I think you need to model the world you are interested in exposing (in terms of entities and relationships) and then agree a set of properties to describe each of the entities - this needs to be done against a statement of what functional requirements you are trying to meet. This should be done in line with RDF and Linked Data. It may be possible to short-cut this process by adopting (or adopting) the EDM from Europeana - but only if that model meets your functional needs.

Mia said

at 12:55 am on Nov 2, 2010

I suspect the EDM isn't rich enough to make the data re-usable in its own right, though it sounds useful for discovering resources in an aggregated environment.

I've been thinking about it, and I have a feeling that putting something out there and letting people try it (ideally in a space where you can watch them using it and capture requirements on the spot) might be the best way to work up a robust model. Dealing with possibly competing needs and/or preferences might be tricky, but bringing it back to the needs of your core users/stakeholders usually helps.

huberrob said

at 10:50 am on Nov 3, 2010

I don't know if this helps, but I have played a while with LIDO, DC, museumdat and other standards in order to use them as exchange formats for my CollectConcept toy. Some examples for LIDO in the wild can be found here: http://www.collectconcept.de/index.php?articleid=4
I have not yet enabled OAI for LIDO, but you can visit the OAI here: http://www.collectconcept.de/oai.php?verb=ListMetadataFormats to see more DC and museumdat examples

Richard Light said

at 11:56 am on Nov 3, 2010

I've just produced a test LIDO 1.0 file from real (if outdated) museum data, see:

http://light.demon.co.uk/wt1990-lido.xml

Having done this, my feeling is that LIDO suffers from the same problem as full CIDOC-CRM for Linked Data applications: the data is too widely separated by structure. (Whereas typical DC-based LD apps, dbpedia, etc. suffer the opposite problem: simple lists of triples, i.e. not enough structure to support co-contextual relationships.)

You don't have permission to comment on this page.