• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Want to get organized in 2022? Let Dokkio put your cloud files (Drive, Dropbox, and Slack and Gmail attachments) and documents (Google Docs, Sheets, and Notion) in order. Try Dokkio (from the makers of PBworks) for free. Available on the web, Mac, and Windows.


LOD-LAM Messy data and same-as

Page history last edited by Mia 10 years, 7 months ago

Attendees: Mia Ridge, Shawn, David Henry, Rob Warren, John Deck, Eric Kansa, Asa Letourneau, Lisa Dawn Colvin, Gerry Parsons, Antoine Isaac


Intros round: issues around data that's beyond messy - some data is unknown, lots of ambiguity, lots of gaps in 'authoritative' data, lots of incompleteness, what are the semantics of 'same-as', separating facts from policies.


First question - how clean does data have to be before you move forward? If it's the best information you have, can you still publish it?.


Learning to live with messiness: making it ok for things to be messy because otherwise only a tiny percentage of data will every be released.


What do you do when data doesn't fit the model - there's a gap in what currently fits into the schema and what's too messy.


Finding ways to manage ambiguity - ways to store information as it's resolved.


Finding right point to assert certainty in data and how to mark things as unknown/messy.


Feeding corrections back - have to think through lifecycle of data - not just putting it out there but being able to ingest any improvements back in. [Action point - recommendations or models for different LAM domains?] - otherwise lost opportunities (again, theme from crowdsourcing session).  Which leads to managing provenance of data or corrections (again)...


Using something like Freebase as way of linking information in different repositories...


Gate-keeping around core collections systems (and documentation backlog) might lead to layered model to store additional information (doesn't deal with corrections to core records still but it's a start) - 'thou shalt not touch a record'. External systems for discovery of other links e.g. through other APIs, crowdsourced data... Being able to follow links - first step to cleaning up data is being able to link to other sources... Store the same-as links in other servcies.


Is it an open secret that same-as is already messy?  Idealistic to assume that everything already asserted is true?


But what if you need to reason with it?  Already need to pull data in and deal with it locally before trying to do reasoning with it?


Different levels of messiness, some might be more acceptable than others.  Need to distinguish between metadata about the resource (and e.g. creation dates etc for it) and metadata about the descriptions of the resource - some data providers haven't done this for e.g. Europeana and it causes problems for attempts at reasoning.


Adding provenance information to same-as? 


Same-as vs 'might be the same-as' or 'is quite similar to'...


What's the minimum level of data quality to put something up on the web - when is it too poor to be useful?


Do you need to track provenance at the level of a triple or a set of triples?


Design patterns for provenance and named graphs [action point?]


Maybe you can't get away from having a human in the loop to finding and reasoning across data sets - is it a pipe dream? Or can you encode provenance and parameterise trust for data sets?  Or create the equivalent of 'page rank' for data sets?


Marking inferred vs created data... eg generated data created by reasoning up and down the graph.


Concluding that need the ability to record confidence, provenance before can reason confidently.  URIs for assertions... at what point does the model start to break?  New version of RDF?


Conclusion: don't feel bad about your messy data; working together to fix it and sharing solutions.  Using the data, testing it is a great way to start fixing it.

Comments (0)

You don't have permission to comment on this page.