| 
View
 

LOD-LAM Messy data and same-as

This version was saved 13 years, 6 months ago View current version     Page history
Saved by Mia
on June 2, 2011 at 8:31:53 pm
 

Attendees: Mia Ridge, Shawn, David Henry, Rob Warren, John Deck, Eric Kansa, Asa Letourneau, Lisa Dawn Colvin, Gerry Parsons, Antoine Isaac

 

Intros round: issues around data that's beyond messy - some data is unknown, lots of ambiguity, lots of gaps in 'authoritative' data, lots of incompleteness, what are the semantics of 'same-as', separating facts from policies.

 

First question - how clean does data have to be before you move forward? If it's the best information you have, can you still publish it?.

 

Learning to live with messiness: making it ok for things to be messy because otherwise only a tiny percentage of data will every be released.

 

What do you do when data doesn't fit the model - there's a gap in what currently fits into the schema and what's too messy.

 

Finding ways to manage ambiguity - ways to store information as it's resolved.

 

Finding right point to assert certainty in data and how to mark things as unknown/messy.

 

Feeding corrections back - have to think through lifecycle of data - not just putting it out there but being able to ingest any improvements back in. [Action point - recommendations or models for different LAM domains?] - otherwise lost opportunities (again, theme from crowdsourcing session).  Which leads to managing provenance of data or corrections (again)...

 

Using something like Freebase as way of linking information in different repositories...

 

Gate-keeping around core collections systems (and documentation backlog) might lead to layered model to store additional information (doesn't deal with corrections to core records still but it's a start) - 'thou shalt not touch a record'. External systems for discovery of other links e.g. through other APIs, crowdsourced data... Being able to follow links - first step to cleaning up data is being able to link to other sources... Store the same-as links in other servcies.

 

Is it an open secret that same-as is already messy?  Idealistic to assume that everything already asserted is true?

 

But what if you need to reason with it?  Already need to pull data in and deal with it locally before trying to do reasoning with it?

 

Different levels of messiness, some might be more acceptable than others.  Need to distinguish between metadata about the resource (and e.g. creation dates etc for it) and metadata about the descriptions of the resource - some data providers haven't done this for e.g. Europeana and it causes problems for attempts at reasoning.

 

Adding provenance information to same-as? 

 

Same-as vs 'might be the same-as' or 'is quite similar to'...

 

What's the minimum level of data quality to put something up on the web - when is it too poor to be useful?

 

Do you need to track provenance at the level of a triple or a set of triples?

 

Design patterns for provenance and named graphs [action point?]

 

Maybe you can't get away from having a human in the loop to finding and reasoning across data sets - is it a pipe dream? Or can you encode provenance and parameterise trust for data sets?  Or create the equivalent of 'page rank' for data sets?

 

Marking inferred vs created data... eg generated data created by reasoning up and down the graph.

 

Concluding that need the ability to record confidence, provenance before can reason confidently.

Comments (0)

You don't have permission to comment on this page.