Wikidata:Events/Data Quality Days 2022/Conversation1

From Wikidata
Jump to navigation Jump to search

conversation #1: Round-tripping data[edit]

Facilitation: Manuel Merz, Lydia Pintscher

👥 Number of participants (including speakers): at 10:58 11 people ; at 11:00, 14 ; at 11:13 = 17 people; at session end - 14

🎯 Key takeaways and outcomes

☑️ Action plan

  • Next steps
    • Talk about creating a new Property for contacts to external sources (for contact information, for website page, other)
    • Create landing page on Wikidata for all of this
  • Where? When? 
    • We will dedicate the open session on Sunday 10:00 UTC to this

🖊️ General notes

  • There are excellent gold-standard sources out there that Wikidata can use, but even those make mistakes. The same is true if you use data from Wikidata for your own project. Therefore, syncs from and to Wikidata should ideally go in both directions (so called “data round-tripping”). Unfortunately, it is currently not as simple as it should be to set this up sustainably.
  • Goals of the session:
    • Collect existing hurdles for setting up round-tripping
    • Additional collection (likely no focus in the discussion)
      • Share examples of where this works particularly well already (we could use these for sharing best practices)
      • Collect examples where building up new syncs with external sources would be of great benefit to data quality
    • Discuss how we can help users who want to set-up sustainable round-tripping
    • Find allies to improve the status-quo

❓ Did you like this new format? What can we improve?

  • Probably we need longer sessions ... Yes we do. +1
  • It would be great to have some of the external data source providers participate in the session next time.
  • Suggestion: More free form sessions

Issue #1: Sometimes its not easy to contact external data providers[edit]

💬 Discussion about this issue

  • Description: For many properties we have no contact information where to report possible errors/mismatches or if there is, the reports often get ignored. A collected list of possible ways to contact the data providers in question would be nice (for those where we already have a working contact), for others we will need to establish such a connection.
  • This is the main problem: establish a connection with providers which don't provide any hint for reporting mistakes; sometimes also providers giving a way to send reports in fact don't read those reports, so also establishing a true contact with them is needed (9.7.2022 at 11:23 o'clock)
  • Is this a topic to have?
    • Round tripping should be something that goes around and around automatically 
      • The other person would be involved anyways
    • to put roundtripping you need to contact first and even when the workflow is in place, there is exception that need discussion 
    • Looking at examples (e.g. sweden), its not only hard to contact but requires a huge buy in before you reach this collab point. maybe this is thinking too small? We would only have round tripping with peopel already wanting to have active participation with ewikidata
    • it is also something that interests wikdiata re: dataquality. esp. big databases (VIAF, ISNI, bibliographical databases, big libraries). there are sometimes data inputs which goes to wikidata, if the data is wrong we can make it as deprececated. but we can keep it so that we can avoid future issues. with round tripping though we could avoid having this deprecated data 
    • ^ +1 from chat
    • Yupik says: Jan: I have to contact externals often for Saami-related stuff, because the information is often just plain wrong.
      • maybe they can be a source for roundtripping?
      • maybe but id that round-tripping or "ordinary data correction"?
    • Round tripping = gold standard?
      • Lydia - idk? whichever source we roundtrip with, there will be an adjustment phase. Unless we get a source that data is just so perfect/ amazing 
        • Jan Ainali (User:Ainali) says:I would surprise if we find such a high quality data source Lydia 😃 
      • Manuel: So we should start with excellent sources where we think wikidata should adopt thier data
      • Lydia: if these are accessible,  sure? But is it feasible? What about licensing, machine readable, and all these secondary issues.
      • Manuel: ALSO THEY CAN MAKE MISTAKES D: Data sources with more frequent corrections, then automized interactions could make sense. 
      • Manuel: What was your experience with your sources?
        • Jan: I was't involved in roundtripping, ??? Do we now have a way to record a property to report an error? That would be a good start for it perhaps. Sewdish MP data are good, but you have to look on their page for contact because we don't have it at the moment, we can just record on the property talk page the way/email to contact the service
        • In this case it would be also machine-readable...
        • Lydia: lets bring this to tomorrow's free slot?
        • Camillo: a unique property is difficult because there are at least two possible datatypes: email and website (online web-form); so maybe two properties  (https://de.wikipedia.org/wiki/Wikipedia:GND/Fehlermeldung)
        • Do we have an idea on why an institution might want to collaborate and why they might not want it?
          • One way is that they're improving thier data too, so more eyes (ie wikidata eyes) = good 
          • Or they benefit from the connections that Wikidata brings
          • error correction is big (less that they have new data). Could be that they miss something and wikidata can help.
        • Good food for thoughts for the chapters!
        • Camillo (EpĂŹdosis) says:simple example of the problem of sync: the Vatican Library (https://opac.vatlib.it/) has an excellent authority file, vital for a lot of religious authors; but it has some mistakes and no way at all to report mistakes. I have sent mistakes to at least three persons and offices, with no result ... 
          • Vatican is VERY top-down. They were opening up to VIAF and for them this was a huge step forward. They are still focused on their data and making sure it's the best it can be. No outward indication that they can make mistakes (and this is unlikely to change) 
          • Sotho Tal Ker says:Maybe the WMF itself has to step in as an organization, instead of a small user 
    • Sotho Tal Ker says: First we have to establish a stable contact, then we can try to create an automatic process to provide corrections to them. and they to us. 
      • Sotho Tal Ker says:Maybe we should start by collecting contact information for these data providers first, then we can see if they are even interested in data roundtripping. especially when we wanna create a tool.
      • [discussion about if it's possible for WMF and/or chapters to step in to help]
    • Camillo: we need some way to report both mistakes in a semiautomated way (on the basis of constraints) and in a manual way (for more problematic issues) 

Issue #2: A new tool could possibly improve our sync with external databases?[edit]

💬 Comments about this issue

  • From the comments
    • some features of a possible new tool:
      • allows access of Wikidata users and of interested institutions managing the databases on which one or more external ID on Wikidata are based
      • two types of reports from Wikidata to the external databases:
        • Wikidata users can report an issue manually, through the tool interface or directly from a Wikidata item through an apposite gadget
        • an automatic system periodically harvests constraint violations and creates an automatic issue for each of them
        • the automatic system should also harvest statements referenced with the database but deprecated with qualifier P2241: Q29998666 (reason: error in referenced source)
        • the institution can solve reports manually: each employee of the institution can log in and solve issues
        • possible improvement: give the institution some possibility to solve reports also semiautomatically
    • Please do not make the tool work based solely on string matching labels. Add in other checks too or we will end up with a lot of conflation again when the tool is used by people unfamiliar with the subject. Or then like with the scholarly articles, no connection between the author and their item...
    • The tool would be mainly meant to overcome the problem of "not easy to contact external data providers"
    • no string matching labels, of course
    • if an ID X is present in an item Y, and the ID X contains something wrong, then it could be reported through the tool
  • ...The tool is the second step imho, after we established stable contacts from #1. As already mentioned, some sources seem not to be interested in corrections. Also the tool should only provide hints, but not do fully automated edits/reports. I still want a human to look at the data first. :)

Issue #3: Modeling differences between the other data source and Wikidata[edit]

💬 Comments about this issue

  • Description: +1 (VIGNERON), modeling or granularity difference can be very painful example: human and person (only one item in Wikidata, one by person on some bibliographical database, IFLA-LRM model for instance)
    • pseudonyms are usually handled as separate entities i.e. ISNI, GND
    • +1 on the problem of pseudonyms, on Wikidata they are usually in the same item as the person unless some Wikipedia has a specific article for the pseudonym
  • Related: We need to map the other data source and Wikidata to each other (correctly).
    • Mergeable with "Modeling differences between the other data source and Wikidata"?
      • I think those are two different steps. We have to put in the work to map even if the modeling was the same in both.
  • How do we model the differences between different data sources? Where the data source and wd have a 1-to-1 mapping, but then some other source has it many-to-1?

Issue #4: It is hard to monitor recent changes for a specific property[edit]

💬 Comments about this issue

Issue #5: Prevent IP addresses from editing identifiers to preserve data integrity[edit]

💬 Comments about this issue

  • -1 there is no proven data about IP doing more harm than good - most of the IP vandalism is connected to highly sensitive items, as much as happens on Wikipedia 9.7.2022 at 12:17 o'clock
  • (EpĂŹdosis) I would suggest to maybe prevent IPs from editing existing identifiers (both changing or removing them), but not to prevent them from adding new identifiers 9.7.2022 at 12:17 o'clock
  • German Wikipedia has a system called "sichten", where untrusted edits have to get approved first before they are shown in the general article view (not logged in users). Maybe making edits from IPs and new users on mobile a two-step process which have to approved by a well known editor first. 9.7.2022 at 12:20 o'clock
  • it could be a good solution, if techically applicable 9.7.2022 at 12:21 o'clock
  • Currently, patrolling is an option, but it seems very rarely used and the process itself is not so straightforward:
  • https://www.wikidata.org/wiki/Wikidata:Patrol

Full documentation of the board[edit]

Existing hurdles for setting up round-tripping 

  • Sometimes it is not easy to contact external data providers (6❤️)
    • For many properties we have no contact information where to report possible errors/mismatches or if there is, the reports often get ignored.
    • A collected list of possible ways to contact the data providers in question would be nice (for those where we already have a working contact), for others we will need to establish such a connection.
      • This is the main problem: establish a connection with providers which don't provide any hint for reporting mistakes; sometimes also providers giving a way to send reports in fact don't read those reports, so also establishing a true contact with them is needed
  • A new tool could possibly improve our sync with external databases? (5❤️)
    • some features of a possible new tool: * allows access of Wikidata users and of interested institutions managing the databases on which one or more external ID on Wikidata are based * two types of reports from Wikidata to the external databases: ** Wikidata users can report an issue manually, through the tool interface or directly from a Wikidata item through an apposite gadget ** an automatic system periodically harvests constraint violations and creates an automatic issue for each of them
    • *** the automatic system should also harvest statements referenced with the database but deprecated with qualifier P2241: Q29998666 (reason: error in referenced source) * the institution can solve reports manually: each employee of the institution can log in and solve issues * possible improvement: give the institution some possibility to solve reports also semiautomatically
    • Please do not make the tool work based solely on string matching labels. Add in other checks too or we will end up with a lot of conflation again when the tool is used by people unfamiliar with the subject. Or then like with the scholarly articles, no connection between the author and their item...
    • The tool woudl be mainly meant to overcome the problem of "not easy to contact external data providers"
    • no string matching labels, of course
    • if an ID X is present in an item Y, and the ID X contains something wrong, then it could be reported through the tool
  • Modeling differences between the other data source and Wikidata (4❤️)
    • +1 (VIGNERON), modeling or granularity difference can be very painful example: human and person (only one item in Wikidata, one by person on some bibliographical database, IFLA-LRM model for instance)
    • pseudonyms are usually handled as separate entities i.e. ISNI, GND
    • +1 on the problem of pseudonyms, on Wikidata they are usually in the same item as the person unless some Wikipedia has a specific article for the pseudonym
  • Prevent IP addresses from editing identifiers to preserve data integrity. (3❤️)
    • -1 there is no proven data about IP doing more harm than good - most of the IP vandalism is connected to highly sensitive items, as much as happens on Wikipedia
    • (EpĂŹdosis) I would suggest to maybe prevent IPs from editing existing identifiers (both changing or removing them), but not to prevent them from adding new identifiers
    • German Wikipedia has a system called "sichten", where untrusted edits have to get approved first before they are shown in the general article view (not logged in users). Maybe making edits from IPs and new users on mobile a two-step process which have to approved by a well known editor first.
    • it could be a good solution, if techically applicable
  • We need to map the other data source and Wikidata to each other (correctly). (2❤️)
    • Mergeable with "Modeling differences between the other data source and Wikidata"?
    • I think those are two different steps. We have to put in the work to map even if the modeling was the same in both.
    • How do we model the differences between different data sources? Where the data source and wd have a 1-to-1 mapping, but then some other source has it many-to-1?

Examples of where this already works particularly well 

  • Genewiki's curation workflows (2❤️)

Examples where building up new syncs with external sources would be of great benefit to data quality 

Related issues:

  • Orcid (2❤️)
    • I really wish Orcid would connect to Wikidata by putting QIDs on Orcid pages (let me dream, OK?)

Follow-up session[edit]

Contributors: User:EpĂŹdosis, User:Sotho_Tal_Ker, Manuel


👥 Number of participants (including speakers): 4

🖊️ Notes & links


LANDING PAGE[edit]

Title: Wikidata: Data Roundtripping

https://www.wikidata.org/wiki/Wikidata:Data_round-tripping

Structure:

  • definition of roundtripping
  • general introduction + why is this important & attractive to external sources
  • current situation and goals (how we want to change it in the future)
  • clear indication of existing ways to report external errors
  • hopefully an explanation of how the new tool works ^^
  • best practice examples
  • maybe bibliography 


Related: 


PROPERTY[edit]

  • easier to find compared to discussion page of the property (most reachable way)


Ways to implement this

  • at least 2 properties (email, website)
  • 1 property (string value), potentially with qualifiers
  • use 2 existing properties with qualifiers
  • the simplest way to do this might be to create a link to an 'errors' wiki page, similar to https://de.wikipedia.org/wiki/Wikipedia:GND/Fehlermeldung (see last comment)

TOOL[edit]


some features of a possible new tool:* allows access of Wikidata users and of interested institutions managing the databases on which one or more external ID on Wikidata are based

  • two types of reports from Wikidata to the external databases:
    • Wikidata users can report an issue manually, through the tool interface or directly from a Wikidata item through an apposite gadget
    • an automatic system periodically harvests constraint violations and creates an automatic issue for each of them
      • the automatic system should also harvest statements referenced with the database but deprecated with qualifier P2241: Q29998666 (reason: error in referenced source)
  • the institution can solve reports manually: each employee of the institution can log in and solve issues
  • possible improvement: give the institution some possibility to solve reports also semiautomatically


=> primary goal should be a structured way of reporting (equivalent to the mismatch finder but for the external sources) 

  • Wikidata editor goes to property in the tool and adds an entry about a mistake

-> the institution can access this and work on it

+

  • would spare editors time (currently it's complicated: e.g. language difficulties, different processes)
  • would make it more likely that the institution can and will work on the issues (more practical than emails)
  • everyone could see the number of unsolved errors per external data source (more public than emails)
  • maybe we can use some work from Mismatch finder? or maybe it is more like Phabricator?
    • Mismatch finder also has external errors to report back (if something war reported on error)


Structure

  • Property
    • List of issues
      • External ID
      • WD Item
      • Status (can be changed by institution)
      • Comment by the reporter (why do we think this is wrong; automatic or manual)
  • Maybe something to report systematic errors (that exist in a group of values of the external ID)


Mismatch table


Gadget

  • button for each external ID value "add mistake report" 
  • should allow to enter a manual report



Needs research on what the institutions need to work on this. 


❓ Questions and discussions


Q.: Should old errors remain on wikidata (with suitable deprecation reason), after they have been changed/fixed in referenced source?
(Myself I think they should, especially if reference has access date; but tool should be aware of this)

Comment: Simplest property would just be a link to an existing 'errors' wiki page.  Property would be useful even if only to track which properties have these.

☑️ Next steps