Wikidata:Events/Data Quality Days 2022/Conversation1

conversation #1: Round-tripping data[edit]

Facilitation: Manuel Merz, Lydia Pintscher

👥 Number of participants (including speakers): at 10:58 11 people ; at 11:00, 14 ; at 11:13 = 17 people; at session end - 14

🎯 Key takeaways and outcomes

Better documenting contacts in properties (->new Property proposal? for contact, for website page, other)
Create documentation for external data providers about benefits of taking mistake-reports from Wikidata (possibly with examples of past successful experiences)
Chapters could help building working relationships with Institutions
example of past experiences, what are the best practices (what works and what doesn't), some pages already exist but need to be synthesized and made more findable
- https://meta.wikimedia.org/wiki/Wikimedia_Commons_Data_Roundtripping - (for Commons data, but some of it is structured - also the title sounds more generic than what the content shows)
- https://meta.wikimedia.org/wiki/Structured_data_for_GLAM-Wiki/Roundtripping
- https://diff.wikimedia.org/2019/12/13/data-roundtripping-a-new-frontier-for-glam-wiki-collaborations/

☑️ Action plan

Next steps
- Talk about creating a new Property for contacts to external sources (for contact information, for website page, other)
- Create landing page on Wikidata for all of this
Where? When?
- We will dedicate the open session on Sunday 10:00 UTC to this

🖊️ General notes

There are excellent gold-standard sources out there that Wikidata can use, but even those make mistakes. The same is true if you use data from Wikidata for your own project. Therefore, syncs from and to Wikidata should ideally go in both directions (so called “data round-tripping”). Unfortunately, it is currently not as simple as it should be to set this up sustainably.
Goals of the session:
- Collect existing hurdles for setting up round-tripping
- Additional collection (likely no focus in the discussion)
  - Share examples of where this works particularly well already (we could use these for sharing best practices)
  - Collect examples where building up new syncs with external sources would be of great benefit to data quality
- Discuss how we can help users who want to set-up sustainable round-tripping
- Find allies to improve the status-quo

❓ Did you like this new format? What can we improve?

Probably we need longer sessions ... Yes we do. +1
It would be great to have some of the external data source providers participate in the session next time.
Suggestion: More free form sessions

Issue #1: Sometimes its not easy to contact external data providers[edit]

💬 Discussion about this issue

Description: For many properties we have no contact information where to report possible errors/mismatches or if there is, the reports often get ignored. A collected list of possible ways to contact the data providers in question would be nice (for those where we already have a working contact), for others we will need to establish such a connection.
This is the main problem: establish a connection with providers which don't provide any hint for reporting mistakes; sometimes also providers giving a way to send reports in fact don't read those reports, so also establishing a true contact with them is needed (9.7.2022 at 11:23 o'clock)
Is this a topic to have?
- Round tripping should be something that goes around and around automatically
  - The other person would be involved anyways
- to put roundtripping you need to contact first and even when the workflow is in place, there is exception that need discussion
- Looking at examples (e.g. sweden), its not only hard to contact but requires a huge buy in before you reach this collab point. maybe this is thinking too small? We would only have round tripping with peopel already wanting to have active participation with ewikidata
- it is also something that interests wikdiata re: dataquality. esp. big databases (VIAF, ISNI, bibliographical databases, big libraries). there are sometimes data inputs which goes to wikidata, if the data is wrong we can make it as deprececated. but we can keep it so that we can avoid future issues. with round tripping though we could avoid having this deprecated data
- ^ +1 from chat
- Yupik says: Jan: I have to contact externals often for Saami-related stuff, because the information is often just plain wrong.
  - maybe they can be a source for roundtripping?
  - maybe but id that round-tripping or "ordinary data correction"?
- Round tripping = gold standard?
  - Lydia - idk? whichever source we roundtrip with, there will be an adjustment phase. Unless we get a source that data is just so perfect/ amazing
    - Jan Ainali (User:Ainali) says:I would surprise if we find such a high quality data source Lydia 😃
  - Manuel: So we should start with excellent sources where we think wikidata should adopt thier data
  - Lydia: if these are accessible, sure? But is it feasible? What about licensing, machine readable, and all these secondary issues.
  - Manuel: ALSO THEY CAN MAKE MISTAKES D: Data sources with more frequent corrections, then automized interactions could make sense.
  - Manuel: What was your experience with your sources?
    - Jan: I was't involved in roundtripping, ??? Do we now have a way to record a property to report an error? That would be a good start for it perhaps. Sewdish MP data are good, but you have to look on their page for contact because we don't have it at the moment, we can just record on the property talk page the way/email to contact the service
    - In this case it would be also machine-readable...
    - Lydia: lets bring this to tomorrow's free slot?
    - Camillo: a unique property is difficult because there are at least two possible datatypes: email and website (online web-form); so maybe two properties (https://de.wikipedia.org/wiki/Wikipedia:GND/Fehlermeldung)
    - Do we have an idea on why an institution might want to collaborate and why they might not want it?
      - One way is that they're improving thier data too, so more eyes (ie wikidata eyes) = good
      - Or they benefit from the connections that Wikidata brings
      - error correction is big (less that they have new data). Could be that they miss something and wikidata can help.
    - Good food for thoughts for the chapters!
    - Camillo (Epìdosis) says:simple example of the problem of sync: the Vatican Library (https://opac.vatlib.it/) has an excellent authority file, vital for a lot of religious authors; but it has some mistakes and no way at all to report mistakes. I have sent mistakes to at least three persons and offices, with no result ...
      - Vatican is VERY top-down. They were opening up to VIAF and for them this was a huge step forward. They are still focused on their data and making sure it's the best it can be. No outward indication that they can make mistakes (and this is unlikely to change)
      - Sotho Tal Ker says:Maybe the WMF itself has to step in as an organization, instead of a small user
- Sotho Tal Ker says: First we have to establish a stable contact, then we can try to create an automatic process to provide corrections to them. and they to us.
  - Sotho Tal Ker says:Maybe we should start by collecting contact information for these data providers first, then we can see if they are even interested in data roundtripping. especially when we wanna create a tool.
  - [discussion about if it's possible for WMF and/or chapters to step in to help]
- Camillo: we need some way to report both mistakes in a semiautomated way (on the basis of constraints) and in a manual way (for more problematic issues)

Issue #2: A new tool could possibly improve our sync with external databases?[edit]

💬 Comments about this issue

From the comments
- some features of a possible new tool:
  - allows access of Wikidata users and of interested institutions managing the databases on which one or more external ID on Wikidata are based
  - two types of reports from Wikidata to the external databases:
    - Wikidata users can report an issue manually, through the tool interface or directly from a Wikidata item through an apposite gadget
    - an automatic system periodically harvests constraint violations and creates an automatic issue for each of them
    - the automatic system should also harvest statements referenced with the database but deprecated with qualifier P2241: Q29998666 (reason: error in referenced source)
    - the institution can solve reports manually: each employee of the institution can log in and solve issues
    - possible improvement: give the institution some possibility to solve reports also semiautomatically
- Please do not make the tool work based solely on string matching labels. Add in other checks too or we will end up with a lot of conflation again when the tool is used by people unfamiliar with the subject. Or then like with the scholarly articles, no connection between the author and their item...
- The tool would be mainly meant to overcome the problem of "not easy to contact external data providers"
- no string matching labels, of course
- if an ID X is present in an item Y, and the ID X contains something wrong, then it could be reported through the tool
...The tool is the second step imho, after we established stable contacts from #1. As already mentioned, some sources seem not to be interested in corrections. Also the tool should only provide hints, but not do fully automated edits/reports. I still want a human to look at the data first. :)

Issue #3: Modeling differences between the other data source and Wikidata[edit]

💬 Comments about this issue

Description: +1 (VIGNERON), modeling or granularity difference can be very painful example: human and person (only one item in Wikidata, one by person on some bibliographical database, IFLA-LRM model for instance)
- pseudonyms are usually handled as separate entities i.e. ISNI, GND
- +1 on the problem of pseudonyms, on Wikidata they are usually in the same item as the person unless some Wikipedia has a specific article for the pseudonym
Related: We need to map the other data source and Wikidata to each other (correctly).
- Mergeable with "Modeling differences between the other data source and Wikidata"?
  - I think those are two different steps. We have to put in the work to map even if the modeling was the same in both.
How do we model the differences between different data sources? Where the data source and wd have a 1-to-1 mapping, but then some other source has it many-to-1?

Issue #4: It is hard to monitor recent changes for a specific property[edit]

💬 Comments about this issue

there is https://wikidata-todo.toolforge.org/sparql_rc.php is there other? how to integrate it in a pipelin/workflow? 9.7.2022 at 12:12 o'clock
I also suggest https://www.wikidata.org/w/index.php?title=Special:RecentChangesLinked&target=Property:P9164&namespace=0&showlinkedto=1 (as an example for P9164) to monitor all items having a specific property. Works well especially for properties not extremely-widely-used 9.7.2022 at 12:15 o'clock

Issue #5: Prevent IP addresses from editing identifiers to preserve data integrity[edit]

💬 Comments about this issue

-1 there is no proven data about IP doing more harm than good - most of the IP vandalism is connected to highly sensitive items, as much as happens on Wikipedia 9.7.2022 at 12:17 o'clock
(Epìdosis) I would suggest to maybe prevent IPs from editing existing identifiers (both changing or removing them), but not to prevent them from adding new identifiers 9.7.2022 at 12:17 o'clock
German Wikipedia has a system called "sichten", where untrusted edits have to get approved first before they are shown in the general article view (not logged in users). Maybe making edits from IPs and new users on mobile a two-step process which have to approved by a well known editor first. 9.7.2022 at 12:20 o'clock
it could be a good solution, if techically applicable 9.7.2022 at 12:21 o'clock
Currently, patrolling is an option, but it seems very rarely used and the process itself is not so straightforward:
https://www.wikidata.org/wiki/Wikidata:Patrol

Full documentation of the board[edit]

Existing hurdles for setting up round-tripping

Sometimes it is not easy to contact external data providers (6❤️)
- For many properties we have no contact information where to report possible errors/mismatches or if there is, the reports often get ignored.
- A collected list of possible ways to contact the data providers in question would be nice (for those where we already have a working contact), for others we will need to establish such a connection.
  - This is the main problem: establish a connection with providers which don't provide any hint for reporting mistakes; sometimes also providers giving a way to send reports in fact don't read those reports, so also establishing a true contact with them is needed

A new tool could possibly improve our sync with external databases? (5❤️)
- some features of a possible new tool: * allows access of Wikidata users and of interested institutions managing the databases on which one or more external ID on Wikidata are based * two types of reports from Wikidata to the external databases: ** Wikidata users can report an issue manually, through the tool interface or directly from a Wikidata item through an apposite gadget ** an automatic system periodically harvests constraint violations and creates an automatic issue for each of them
- *** the automatic system should also harvest statements referenced with the database but deprecated with qualifier P2241: Q29998666 (reason: error in referenced source) * the institution can solve reports manually: each employee of the institution can log in and solve issues * possible improvement: give the institution some possibility to solve reports also semiautomatically
- Please do not make the tool work based solely on string matching labels. Add in other checks too or we will end up with a lot of conflation again when the tool is used by people unfamiliar with the subject. Or then like with the scholarly articles, no connection between the author and their item...
- The tool woudl be mainly meant to overcome the problem of "not easy to contact external data providers"
- no string matching labels, of course
- if an ID X is present in an item Y, and the ID X contains something wrong, then it could be reported through the tool

Modeling differences between the other data source and Wikidata (4❤️)
- +1 (VIGNERON), modeling or granularity difference can be very painful example: human and person (only one item in Wikidata, one by person on some bibliographical database, IFLA-LRM model for instance)
- pseudonyms are usually handled as separate entities i.e. ISNI, GND
- +1 on the problem of pseudonyms, on Wikidata they are usually in the same item as the person unless some Wikipedia has a specific article for the pseudonym

It is hard to monitor recent changes for a specific property (4❤️)
- there is https://wikidata-todo.toolforge.org/sparql_rc.php is there other? how to integrate it in a pipelin/workflow?
- I also suggest https://www.wikidata.org/w/index.php?title=Special:RecentChangesLinked&target=Property:P9164&namespace=0&showlinkedto=1 (as an example for P9164) to monitor all items having a specific property. Works well especially for properties not extremely-widely-used

Prevent IP addresses from editing identifiers to preserve data integrity. (3❤️)
- -1 there is no proven data about IP doing more harm than good - most of the IP vandalism is connected to highly sensitive items, as much as happens on Wikipedia
- (Epìdosis) I would suggest to maybe prevent IPs from editing existing identifiers (both changing or removing them), but not to prevent them from adding new identifiers
- German Wikipedia has a system called "sichten", where untrusted edits have to get approved first before they are shown in the general article view (not logged in users). Maybe making edits from IPs and new users on mobile a two-step process which have to approved by a well known editor first.
- it could be a good solution, if techically applicable

We need to map the other data source and Wikidata to each other (correctly). (2❤️)
- Mergeable with "Modeling differences between the other data source and Wikidata"?
- I think those are two different steps. We have to put in the work to map even if the modeling was the same in both.
- How do we model the differences between different data sources? Where the data source and wd have a 1-to-1 mapping, but then some other source has it many-to-1?

Examples of where this already works particularly well

Some good examples (3❤️)
- good examples: * SBN (P396): e-mails work usually well (https://www.wikidata.org/wiki/Property_talk:P396#Mail_per_segnalare_errori_SBN) * GND (P227): reporting on Wikipedia works, although a bit slowly (https://de.wikipedia.org/wiki/Wikipedia:GND/Fehlermeldung) ** see also https://de.wikipedia.org/wiki/Portal:Bibliothek,_Information,_Dokumentation/Normdaten/GND-Kooperation * cases of libraries where some librarians are also Wikidata users, so errors are reported directly to them (e.g. P5739)
- YSO Wikidata linking at the National Library of Finland works well. They are easy to contact and react rather quickly to issues.

Genewiki's curation workflows (2❤️)

Examples where building up new syncs with external sources would be of great benefit to data quality

Some needed syncs (3❤️)
- of fundamental importance, reporting not yet effective: * VIAF (P214): just accumulating in https://www.wikidata.org/wiki/Wikidata:VIAF/cluster/conflating_entities; e-mail for reports usually ineffective ** previously https://en.wikipedia.org/wiki/Wikipedia:VIAF/errors * ISNI (P213): e-mail for reports not always effective
- Attention should be focused on big custerers: VIAF and ISNI conflating two different entities (e.g. persons) could result in importing to Wikidata also other misplaced IDs

Related issues:

Orcid (2❤️)
- I really wish Orcid would connect to Wikidata by putting QIDs on Orcid pages (let me dream, OK?)

Follow-up session[edit]

Contributors: User:Epìdosis, User:Sotho_Tal_Ker, Manuel

👥 Number of participants (including speakers): 4

🖊️ Notes & links

Possible tasks that we discussed for this session:
- Create a landing page on Wikidata about round-tripping data
- Create a property proposal for a property that records how to contact data sources for error corrections
Full documentation of the original session
- https://www.wikidata.org/wiki/Wikidata:Events/Data_Quality_Days_2022/Conversation1

LANDING PAGE[edit]

Title: Wikidata: Data Roundtripping

https://www.wikidata.org/wiki/Wikidata:Data_round-tripping

Structure:

definition of roundtripping
general introduction + why is this important & attractive to external sources
current situation and goals (how we want to change it in the future)
clear indication of existing ways to report external errors
hopefully an explanation of how the new tool works ^^
best practice examples
maybe bibliography

https://meta.wikimedia.org/wiki/Wikimedia_Commons_Data_Roundtripping

PROPERTY[edit]

easier to find compared to discussion page of the property (most reachable way)

Ways to implement this

at least 2 properties (email, website)
1 property (string value), potentially with qualifiers
use 2 existing properties with qualifiers
the simplest way to do this might be to create a link to an 'errors' wiki page, similar to https://de.wikipedia.org/wiki/Wikipedia:GND/Fehlermeldung (see last comment)

TOOL[edit]

some features of a possible new tool:* allows access of Wikidata users and of interested institutions managing the databases on which one or more external ID on Wikidata are based

two types of reports from Wikidata to the external databases:
- Wikidata users can report an issue manually, through the tool interface or directly from a Wikidata item through an apposite gadget
- an automatic system periodically harvests constraint violations and creates an automatic issue for each of them
  - the automatic system should also harvest statements referenced with the database but deprecated with qualifier P2241: Q29998666 (reason: error in referenced source)
the institution can solve reports manually: each employee of the institution can log in and solve issues
possible improvement: give the institution some possibility to solve reports also semiautomatically

=> primary goal should be a structured way of reporting (equivalent to the mismatch finder but for the external sources)

Wikidata editor goes to property in the tool and adds an entry about a mistake

-> the institution can access this and work on it

+

would spare editors time (currently it's complicated: e.g. language difficulties, different processes)
would make it more likely that the institution can and will work on the issues (more practical than emails)
everyone could see the number of unsolved errors per external data source (more public than emails)
maybe we can use some work from Mismatch finder? or maybe it is more like Phabricator?
- Mismatch finder also has external errors to report back (if something war reported on error)

Structure

Property
- List of issues
  - External ID
  - WD Item
  - Status (can be changed by institution)
  - Comment by the reporter (why do we think this is wrong; automatic or manual)
Maybe something to report systematic errors (that exist in a group of values of the external ID)

Mismatch table

https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/exampleMismatchFile.csv

Gadget

button for each external ID value "add mistake report"
should allow to enter a manual report

Needs research on what the institutions need to work on this.

❓ Questions and discussions

insitutions often ask us to have a tool to monitor the changes done on "their" data (data that they entered, or data about their collections, or "their" external ID...), do you see this as part of the tool as well?
- https://www.wikidata.org/w/index.php?title=Special:RecentChangesLinked&target=Property:P9164&namespace=0&showlinkedto=1
- yes, I also remembered a tool that was combining RC and a query from Magnus, but it was broken (?)
  - that tool is still active: https://wikidata-todo.toolforge.org/sparql_rc.php?sparql=%23+Up+to+100+entities+linked+from+a+random+set+of+101+entities+linked+to+the+target+item%0D%0APREFIX+target%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ917447%3E%0D%0ASELECT+DISTINCT+%3Fq+WHERE+%7B%0D%0ASERVICE+bd%3Asample+%7B+++%3Fs+%3Fp1+target%3A+.+bd%3AserviceParam+bd%3Asample.limit+101+%7D+++%0D%0A++%3Fs+%3Fp2+%3Fq+.+%0D%0A++FILTER+%28REGEX%28LCASE%28STR%28IRI%28%3Fq%29%29%29%2C+%22http%3A%2F%2Fwww.wikidata.org%22%29%29%0D%0A++FILTER+%28%21REGEX%28LCASE%28STR%28IRI%28%3Fq%29%29%29%2C+%22%28qualifier%7Cstatement%7Creference%29%22%29%29%0D%0A%7D%0D%0ALIMIT+100&start=90+days+ago&end=&user_lang=&sort_mode=last_edit&no_bots=0&skip_unchanged=0
  - or perhaps you meant this one: https://wdrc.toolforge.org/api.php?action=properties&properties=P9164&since=202207&type=&format=jsonl
Do you want to have a separate ticket for each possible error or should it be done in batches?
- systematic errors (example: alle people of citizans of north mazedonia were only reported as mazedonian)

Q.: Should old errors remain on wikidata (with suitable deprecation reason), after they have been changed/fixed in referenced source?
(Myself I think they should, especially if reference has access date; but tool should be aware of this)

Comment: Simplest property would just be a link to an existing 'errors' wiki page. Property would be useful even if only to track which properties have these.

☑️ Next steps

Create initial version of the landing page (Epidosis)
Create property proposal (Epidosis)
Create Phabricator task for the potential new tool (Manuel)
- I have created a Phabricator task for this https://phabricator.wikimedia.org/T312718
- But maybe a wikipage + properties solution might be something that could be achieved sooner (similar to e.g. https://de.wikipedia.org/wiki/Wikipedia:GND/Fehlermeldung)

Wikidata:Events/Data Quality Days 2022/Conversation1

Contents

conversation #1: Round-tripping data[edit]

Issue #1: Sometimes its not easy to contact external data providers[edit]

Issue #2: A new tool could possibly improve our sync with external databases?[edit]

Issue #3: Modeling differences between the other data source and Wikidata[edit]

Issue #4: It is hard to monitor recent changes for a specific property[edit]

Issue #5: Prevent IP addresses from editing identifiers to preserve data integrity[edit]

Full documentation of the board[edit]

Follow-up session[edit]

LANDING PAGE[edit]

PROPERTY[edit]

TOOL[edit]

Navigation menu

Wikidata:Events/Data Quality Days 2022/Conversation1

conversation #1: Round-tripping data[edit]

Issue #1: Sometimes its not easy to contact external data providers[edit]

Issue #2: A new tool could possibly improve our sync with external databases?[edit]

Issue #3: Modeling differences between the other data source and Wikidata[edit]

Issue #4: It is hard to monitor recent changes for a specific property[edit]

Issue #5: Prevent IP addresses from editing identifiers to preserve data integrity[edit]

Full documentation of the board[edit]

Follow-up session[edit]

LANDING PAGE[edit]

PROPERTY[edit]

TOOL[edit]

Navigation menu

Search