Wikidata talk:Requests for comment/Mapping and improving the data import process

From Wikidata
Jump to navigation Jump to search

New section with general tasks?[edit]

Hi

Does there need to be a new section with general tasks and best practice? E.g translations of pages, making sure pages are well linked together?

Thanks

--John Cummings (talk) 11:59, 12 December 2017 (UTC)[reply]

Yes I think a best practice area will be an important addition. It may be that we need to embed this kind of thing all the way through the documentation, but will probably need it's own section as well. NavinoEvans (talk) 00:17, 15 December 2017 (UTC)[reply]
Perhaps this could also include general goals and principles e.g lowering the barrier to entry to contribute, providing a high quality service to data partners comparible with other popular websites etc? --John Cummings (talk) 18:23, 15 December 2017 (UTC)[reply]
Sounds like a useful addition to me. NavinoEvans (talk) 14:25, 16 December 2017 (UTC)[reply]

Similar work for Structured Commons[edit]

Hello John, Nav and others! For Structured Data on Wikimedia Commons, we are working on similar things (outlining upload processes - both for Commons and Wikidata - and showing gaps and pain points). It seems useful to me to align our efforts. You can check what I've done so far (with quite a bit of input from both the Structured Commons team, and community members) in the second, third, fourth and fifth tab of this spreadsheet. Would there be a way to co-ordinate our efforts? All the best! SandraF (WMF) (talk) 12:25, 13 December 2017 (UTC)[reply]

SandraF (WMF), yes, absolutely, send me an email with a good time to talk :) --John Cummings (talk) 15:20, 13 December 2017 (UTC)[reply]
Update: SandraF (WMF), John Cummings and I had an online meeting to discuss the overlap between our work on the data import process and the requirements for Structured Commons. We've identified several areas where we can work together / share work to prevent us re-doing the same work. See the meeting notes here. NavinoEvans (talk) 12:48, 21 December 2017 (UTC)[reply]

Points of contact[edit]

A Wiki project or task force that can be the first contact point for external organisations who need to find out which parts (if any) of their data set are notable enough for Wikidata. Ideally we need a non-wiki alternative for this, like a mailing list email address or even a person who can be contacted by phone/Skype.

The problem with using a non-wiki alternative is that a non-wiki alternative can't provide consensus that the Wikidata community approves of a dataset being imported in the way a on-wiki solution can.

I could imagine a process that works like WikiCommons photo submission. Alice who's interested to import data sents an email to partnership@wikidata.org which produces an OTRS ticket. Volunteer Bob decides to handle the request of Alice and creates necessary on-wiki discussions, which means "bot approval requests"/"property proposal for external ID"/"Data Hub entries".

One advantage of this is that the team behind "partnership@wikidata.org" will also be able to send out external email with that email address to be proactive about engaging with organisations to cooperate with them.

The problem with such a solution is that it requires willing volunteers who are willing to put in effort. At the moment the amount of volunteers who are interested in getting property proposals or bot requests to move forward and find a consensus is quite limited. Maybe, this problem can be solved by letting people explicitely sign up to be responsible to handle tickets. ChristianKl () 20:42, 13 December 2017 (UTC)[reply]

Perhaps one way to address this is to have a well structure and publicly recorded process for data imports where the partner makes the request (which is very easy to do) and then can be asked questions by a group of people? The issue I can see with having a single person is it would require them to both understand the entire upload process and also it would stop the single point of failure. --John Cummings (talk) 13:05, 14 December 2017 (UTC)[reply]
That gets you the status quo where nobody really feels responsible and you have bot approval requests and property proposals that can take months to get a decision. ChristianKl () 13:42, 14 December 2017 (UTC)[reply]
What can be ways to mitigate these issues? Relying on one person to know all the steps to upload a dataset correctly and follow through with it (at the moment) does not seem realistic. OTRS for Commons has a small number of short term tasks where as uploading a dataset often takes a long time. Bot requests and property proposals would need additional people even if one person promised to do the rest. --John Cummings (talk) 16:19, 14 December 2017 (UTC)[reply]
Bot requests and property proposals do need additional people but that's a matter of asking/pinging additional people to comment. Having one person who feels responsible for bringing the conversation forward is very useful. ChristianKl () 17:04, 14 December 2017 (UTC)[reply]
P.s I added your points about Property Proposals and Bot Requests taking a long time to the document. --John Cummings (talk) 17:02, 14 December 2017 (UTC)[reply]
Could you describe in more detail how you would see a solution with a mailing list to work? ChristianKl () 21:17, 14 December 2017 (UTC)[reply]

The mailing list option I was thinking of would basically be an initial point of contact to help people with the first steps, subscribed to by a few volunteers (and/or Wikimedia staff) who understand the import process and can answer questions that get emailed to the mailing list. Really it would just be another way of making initial contact with someone who knows where to start. You're initial suggestion sounds like a very good approach to me:

"I could imagine a process that works like WikiCommons photo submission. Alice who's interested to import data sents an email to partnership@wikidata.org which produces an OTRS ticket. Volunteer Bob decides to handle the request of Alice and creates necessary on-wiki discussions, which means "bot approval requests"/"property proposal for external ID"/"Data Hub entries". "

I think whatever system we end up with should have this basic setup - a request comes in and someone with wiki experience liaises with the community to check notability. If the request gets past this stage, we go ahead and create the data import hub entry. Of course when the external person/organisation has the necessary wiki/data experience then they can manage it themselves. NavinoEvans (talk) 00:13, 15 December 2017 (UTC)[reply]

The difference between a mailing list and a ticket system is that one person can take ownership of a ticket. The OTRS default policy is that if one person answers a ticket followup emails are also by default answered by the same person and only if that person doesn't answer for 10 days another person is supposed to take over the ticket. Of course the basic answer can also be that the person is supposed to write their own Data hub entry/bot request/property proposal. ChristianKl () 13:36, 15 December 2017 (UTC)[reply]
Yes I see what you mean. The mailing list is really a 'would be nice' extra for the process rather than something essential. The main point of the initial suggestion was to have a place where you can send a quick message without having to learn wiki text. It may be that just having a really good FAQ section would do the same job. NavinoEvans (talk) 14:14, 15 December 2017 (UTC)[reply]
A Flow Structured Discussions talk page could also do the job of providing a way to ask questions without needing to know Wikitext. ChristianKl () 13:28, 16 December 2017 (UTC)[reply]
Good point, this could be the simplest way to address it for now. NavinoEvans (talk) 14:29, 16 December 2017 (UTC)[reply]

Re: Usage of data on other Wikimedia projects[edit]

When I read this section, I feel it doesn't address the complexity of our relationship with EnWiki. There are many issues involved and a lot of them don't have directly something to do with the topic of the data import process. It's related in the sense that organisations who donate data to Wikidata want that data to show up in EnWiki, but a lot of the issues are separate. The problem isn't just about EnWiki folks not understanding Wikidata. There are issues like reference URL (P854) being used with Wikipedia URLs (hopefully soon fixed via https://www.wikidata.org/wiki/Wikidata:Bot_requests#Move_reference_URL_(P854)_references_to_Wikipedia_to_Wikimedia_import_URL_(P4656)), EnWiki folks complain about Wikidata's lack of a BLP policy (hopefully soon fixed via https://www.wikidata.org/wiki/Wikidata:Living_persons_(draft)), abuse of stated in (P248) to reference data contained in other Wikidata items (solution still needed).

There are valid concerns about dealing with vandalism, that likely need ways to have the right content show up in Wikipedia watchlists (right means, that only those changes that are visible in Wikipedia show up and all changes that are visible in Wikipedia show up). Additionally, editing via Wikipedia is also important for making it easier for the Wikipedians to delete vandalism that comes via Wikidata when they see it.

One issue, that might be worth discussing here is the modification of infoboxes. We do have the obsolate https://www.wikidata.org/wiki/Wikidata:WikiProject_Infoboxes. Itt might be worth to bring it up-to-date and use it as a place where people who have questions about introducing Wikidata to Wikipedia infoboxes can ask for help. ChristianKl () 22:08, 13 December 2017 (UTC)[reply]

The graphics in https://www.wikidata.org/wiki/Wikidata:Wikidata_in_Wikimedia_projects are very pretty. Maybe it's worth adding a few pretty graphics to some high traffic EnWiki pages so that we can use them as colorful examples for the value we provide? ChristianKl () 23:06, 13 December 2017 (UTC)[reply]

ChristianKl, thanks very much, I made some additions, (diff), please feel free to add to or correct them. I think you're completely right there is a lot of groundwork that needs to go into this, I guess this is partly to do with Wikidata being a new project than e.g en.wiki, but we have an opportunity to do things well and learn from where other projects have had issues. --John Cummings (talk) 13:15, 14 December 2017 (UTC)[reply]
@ChristianKl: We have a policy: see help:Sources but people don't want to follow rules. And few people wants to add more constraints in data imports or data curation. WD is still a kind of libertarian movement which doesn't not accept strong constraints. Snipre (talk) 23:19, 16 December 2017 (UTC)[reply]
Help:Sources isn't a policy page. Policy pages say on the to that they are policy pages. New policy needs an RfC and not just someone writing up how we wants sources to look. ChristianKl () 23:27, 16 December 2017 (UTC)[reply]

To be added to Maintain data quality[edit]

Suggestions from Magnus Sälgö

Get external actors with trusted information to adopt to the semantic web

  1. ) make them understand that WIkidata trust them and try to "mirrow" that data into WIkidata/Wikipedia
  2. ) that they deliver a machine readable solution
  3. ) that Wikidata has a framework for quality checks
    1. ) checks are done regularly
    2. ) the process is transparent
      1. ) readers of the data can see that this fact has as data source xxx
      2. ) xxx is a "trusted" source --> they have a quality process.... and deliver quality data....
    3. ) differences in data between Wikidata and xxxx are documented
      1. ) warnings are displayed when we have a mismatch. This mismatch should also be possible to see in the WIkipedia infobox e.g. the data displayed is not matching a trusted source....

--John Cummings (talk) 08:08, 15 December 2017 (UTC)[reply]

  • I don't see a good reasons for why a Wikipedia infobox should work that way. Wikidata infoboxes are supposed to be simple in how they display data.
Signed edits would be a better solution. ChristianKl () 15:59, 15 December 2017 (UTC)[reply]
  • >> I don't see a good reasons for why a Wikipedia infobox should work that way....
    • It's about trust showing data without sources is an en:Anti-pattern. Today on internet you don't trust data if its not from an authoritative source. The basic idea of the semantic web was that you should be able to choose what sources you trust. Wikidata is doing it the other way around
      • gather sources with very different quality
      • "massuploading" dont check for Constraint violations
      • many people think the the uploader is not responsible for quality assure the data added is following constraints ....
      • in Wikidata we normally dont document why a statement has another rank or why we add another date with another precision....
      • non quality checked data in Wikidata is displayed directly in Wikipedia infoboxes....
The above suggestion is how to get trust in Wikidata data by checking with trusted authoritative sources and also have a "echo-system" to assure that the data in Wikidata is matching this source .... - Salgo60 (talk) 12:59, 29 December 2017 (UTC)[reply]
Suggestion how to move "forward" direction trust
  1. Wikidata needs to help the reader to understand what sources has "quality"
  2. This quality needs to be "communicated" to the reader and the person using SPARQL to query need an easy way to retrieve the facts based on the best sources... (the ranking concept is good but has proven not good enough when people add facts without ranking....
  3. As Wikipedia has the wikibase:badge approach to "rank" articles the same needs for sources
  4. Example how to rank sources
    1. If the source has
      1. a well documented quality process
      2. has a reputation of quality over the last years...
      3. people skilled in the field rank them as very trustable....
      4. facts are based on primary sources
      5. primary sources are mentioned and can be checked
      6. ....
  5. Wikidata queries need a new easy to use statement typ like wdt but that returns best data value from the best ranked source. Maybe this is in the scope of WikiCitie?!??!
- Salgo60 (talk) 23:30, 2 January 2018 (UTC)[reply]

Primary sources tool[edit]

Hi everyone,

I'm really glad to come across this effort, since it happens to be in sync with the StrepHit project agenda. It is part of the work package on the Wikidata:Primary_sources_tool (cf. task D2) to contribute a data release tutorial for third-party data providers.

If you agree, I'll be pleased to add a section explaining how to add datasets to the primary sources tool, which is almost ready to accept imports in QuickStatements and Wikidata RDF formats. What do you think? Just let me know.

Cheers,
--Hjfocs (talk) 09:42, 15 December 2017 (UTC)[reply]

Hi Hjfocs, thanks very much. So as far as I understand it StrepHit is a method of importing data into Wikidata from external sources (where data isn't in a structure format) and could be used to improve data quality by adding references to existing statements? Is this right? --John Cummings (talk) 13:20, 15 December 2017 (UTC)[reply]
This is great! Thanks a lot for the suggestion, this should definitely be in the import guide. John Cummings, if I understand correctly it would be for statements as well as references :) There are lots of data sets that would be easier to import this way (e.g. when human checking is needed on each statement) NavinoEvans (talk) 14:22, 15 December 2017 (UTC)[reply]
I've added this to section 5. 'Data importing' on the project page. It's part of the breakdown of "Complete list of tools that are needed/useful for data imports, with detailed instructions on how to use each one", so I've also added the other main tools that sprung to mind (which all need varying degrees of work on documentation). NavinoEvans (talk) 14:42, 16 December 2017 (UTC)[reply]

Quality control and autocompletion powered by knowledge embedding[edit]

This is really interesting RFC. As my research at school goes along with knowledge representation, the model seems to be able to help predicting possible item given certain relationship and item. For example, given Douglas Adams and instance of, the model predicts that he is a human. It might be also useful for data importing process. What do you think of this idea?--Fantasticfears (talk) 11:39, 15 December 2017 (UTC)[reply]

Thanks Fantasticfears, do you have any sources I could look at to understand this model better? At the moment I don't understand how this works. Thanks, --John Cummings (talk) 13:16, 15 December 2017 (UTC)[reply]
  • If there's an algorithm that produces a decent output the outputs could be put within the Primary Source tool. I think the usefulness would be more for manually created items than for items that are already imported in bulk. If you import 10,000 artists in your database, you can easily set them all to instance of (P31)human (Q5).
Another way to put such technology to offer suggestions of possible values. Currently, we do have property suggestions but once the user enters a property, there are no suggestions about the values he might pick. It would be very valuable to have good suggestions. ChristianKl () 13:54, 15 December 2017 (UTC)[reply]
This sounds brilliant, I'm also interested to read more about it. Presumably it could also be easily adapted to highlight possible incorrect data? NavinoEvans (talk) 14:31, 15 December 2017 (UTC)[reply]
We have two automated ways to determine possible incorrect data. There's an algorithm that scores edits for whether or not they are correct and we have our systems of property constraints. I don't think it will be easy for something externally developed to be practically used to highlight possible incorrect data. ChristianKl () 16:08, 15 December 2017 (UTC)[reply]
ChristianKl, can you link to these two automated ways please? Thanks, --John Cummings (talk) 16:11, 15 December 2017 (UTC)[reply]
To give links (1) Our property constraint system: https://www.wikidata.org/wiki/Wikidata:WikiProject_property_constraints and (2) ORES : https://www.researchgate.net/publication/314080995_Building_automated_vandalism_detection_tools_for_Wikidata . ChristianKl () 16:34, 15 December 2017 (UTC)[reply]

Adding labels, descriptions and aliases to entities from a dataset[edit]

Dear Mr. John Cummings,

I thank you for inviting me to give my opinions about this important topic. When I assisted to AICCSA 2017 conference, I found that there are many datasets that include labels of entities in Arabic dialects. However, automatically adding them to Wikidata using a bot is absolutely not practical and easy. I ask if there is a tool (just like Descriptioner or QuickStatements. Not a bot) that allows adding labels, descriptions or aliases to Wikidata entities when knowing their Wikidata IDs (beginning with a P or a Q) or their labels in a particular language. --Csisc (talk) 10:07, 17 December 2017 (UTC)[reply]

QuickStatements should be able to add labels, descriptions and aliases. From it's description:
To add a label in a specific language to an item, use "Lxx" instead of the property, with "xx" as the language code.
To add an alias in a specific language to an item, use "Axx" instead of the property, with "xx" as the language code.
To add a description in a specific language to an item, use "Dxx" instead of the property, with "xx" as the language code.ChristianKl () 14:01, 17 December 2017 (UTC)[reply]
ChristianKl: Do you mean that if I have an Excel file in which words in French and their synonyms in Tunisian Arabic exist, I can add the synonyms as labels to Wikidata entities using QuickStatements? --Csisc (talk) 14:25, 17 December 2017 (UTC)[reply]
Yes, that should be possible. ChristianKl () 14:26, 17 December 2017 (UTC)[reply]
ChristianKl: Does this work even if the Wikidata IDs of the entities do not exist in the dataset (Just words in a language in the first column and their synonyms in another language in the second column)? --Csisc (talk) 22:49, 17 December 2017 (UTC)[reply]
If the Wikidata ID doesn't exist you would need to create a new Wikidata item. Unfortunately, we consider a Wikidata item that has just a label in two languages and that has no statements not notable according to our notability policy. ChristianKl () 00:22, 18 December 2017 (UTC)[reply]
To add to this, you only need one statement (ideally with a reference) to stop this happening, e.g 'instance of' = 'human' is acceptable for a person. --John Cummings (talk) 09:30, 18 December 2017 (UTC)[reply]
ChristianKl, John Cummings: I think that you have not understood what I meant. I meant that our dataset include labels of Wikidata entities in English and labels of the same Wikidata entities in South Levantine. It does not include the Wikidata IDs of the Wikidata entities. The required tool should search for the entity corresponding to the label in English and add the label in South Levantine. --Csisc (talk) 19:51, 21 December 2017 (UTC)[reply]
I think I understand you. You can do that as long as there are existing items with that English label. There a way to match English label to Wikidata ID (John Cummings can tell you better then me about the best way to do this).
Unfortunately, for those cases where there's no existing item with that English label you can't simply create a new item as long as the new item would just contain labels in multiple languages and no referenced statements. ChristianKl20:09, 21 December 2017 (UTC)[reply]
ChristianKl: This will not be a problem. If an item with a particular English label does not exist, we can create it. The most important fact is that the tool we need adds labels in South Levantine to Wikidata entities knowing their English or French labels. As well, the required tool should use a table including English words and their equivalents in South Levantine as an input dataset. --Csisc (talk) 22:54, 21 December 2017 (UTC)[reply]
Creating a new item isn't just a matter of saying there should be an item for "appreciating" or "appreaction". You will need to add statements with references to the item for the sake of Wikidata's notability policy. For arbitatry words in a dictionary that's not a task that you will be able to do without human attention to every individual item that's created. ChristianKl01:29, 22 December 2017 (UTC)[reply]
ChristianKl: I know. If we will create new entities, we have to add statements and references to them. However, this is not our purpose for the moment. The most important fact now for us is to create the tool I talked about. The tool that retrieves the Wikidata IDs of the entities having an English label in an input table and that adds South Levantine labels to these Wikidata entities from the same input table. --Csisc (talk) 11:51, 22 December 2017 (UTC)[reply]
You can use OpenRefine for this and use the Wikidata reconciliation feature. However, as has been pointed out before, it would be unwise to just ignore the classes in your terminology document. It's not just about translating labels, its about applying the right label to the right item. Words tend to be polysemic (fr:canapé --> en:sofa; fr:canapé --> en:appetizer). --Beat Estermann (talk) 12:28, 22 December 2017 (UTC)[reply]
Beat Estermann: Luckily, we avoided to match between polysemic words in our dataset. That is why I am sure that OpenRefine will return precise results. However, we will absolutely add statements as conditions if we will match between polysemic words. Thank you for your helpful answer. --Csisc (talk) 13:57, 24 December 2017 (UTC)[reply]
As far as polysemic words go, it's worth noting that Wikidata includes poems and songs who's name is sometimes just one word. In those cases the name is mostly capitalized, but it's worth keeping in mind that a lot in our dataset has proper names that aren't words in the traditional sense. ChristianKl14:09, 24 December 2017 (UTC)[reply]
ChristianKl: Thank you. I will consider that. --Csisc (talk) 21:09, 24 December 2017 (UTC)[reply]

General feedback about the import process as a whole and the focus of the project:[edit]

Thanks, John, for taking this forward. I've finally gotten around to having a look at the proposed improvements of the data import process.

Looking at the entire process, the following points should receive more attention in my opinion:

  • Ontology Development, implementation of the target data model in Wikidata;
  • Ensuring the consistency of class and property descriptions across the various languages;
  • Data cleansing (both in the source file and in the target environment on Wikidata - every time I do a new kind of data ingest, I find plenty of inconsistent data already sitting on Wikidata);
  • Data Maintenance Plan: Which kind of updates are expected in the future? How to deal with them? Will there be a permanent relationship with the data provider, with a corresponding community, or should the ingest be handled as a one time data data import?

Furthermore, I think we need to approach this whole issue of data imports in a more systemic manner in order to get the prioritization of the efforts for improvement right.

I've now been personally engaged in data import activities for about one year and a half; the main conclusion I'm drawing so far is that it is not about getting many open data sets that can be ingested by the existing community, but about building communities that are interested in building and curating a common pool of data (and the corresponding data models) in specific topic areas. - I believe that this would be the main thing to get across - both within the existing Wikidata community and with regard to external data providers.

The second most important point is that we need to get the data to be used on Wikipedia in order to get a feedback loop between data provision and data use. But here again, I guess the key to success will lie in building communities around topics. And we need to be pragmatic: If the English and German Wikipedia communities prove reluctant to include the data, go with the Catalan and others first.

By the way, I have recently been working on a White Paper on Linked Open Data, with a special focus on Swiss public-sector organizations and heritage institutions, which provides an overview of the data publication process. Maybe you want to have a look; feel free also to leave your comments in the Google Doc.

--Beat Estermann (talk) 08:14, 22 December 2017 (UTC)[reply]

Thanks very much Beat, this is really useful, please do add more if you think of it. I think I need to rewrite the introduction a little to make it clear the purpose of this document is to envision where we ideally want to get to and then spend time later on breaking it down into very granular steps. --John Cummings (talk) 16:04, 22 December 2017 (UTC)[reply]

Comments on Section 1 (Provide information about Wikidata to data partners)[edit]

  • "they do not have to speak English to work with Wikidata": I'm not sure whether this is the right message to send out at the moment; English is the de facto lingua franca on Wikidata, and we are presently not equipped to change this anytime soon. In order to manage expectations in a reasonable way, it should be specified what can be done on Wikidata without a good working knowledge of English and what cannot.
  • "A way for data partners to get consistent and professional support for working with Wikidata." - This is something we need to build up outside Wikidata. I'm happy to share my thoughts and first hand experiences related to that. A first step could consist in holding bi-monthly online-meetups for people providing that kind of support. My impression is that we are presently only talking about a handful of people who are actually engaged in this. If we wanted to build up a broader network of Wikidata-savvy consultants proactively, this would probably require some substantial funding.

--Beat Estermann (talk) 08:24, 22 December 2017 (UTC)[reply]

English is the language for policy discussions but that doesn't mean that it's required for someone who submits data that can be expressed with existing properties. A property proposal about an external-ID can also be successful when it's author doesn't write it in English.
The French project chat is relatively active. The German project chat is also active enough that a person can ask for help and get a good answer. Russian might also work but the time for getting an answer might be higher. ChristianKl13:23, 22 December 2017 (UTC)[reply]

Comments on Section 2 (Identify data to be imported)[edit]

  • Another approach would consist in referencing open datasets within the Wikidata structure itself by means of the DCAT vocabulary. See: Wikidata:WikiProject_Datasets for a start. We could then send some data scouts out to systematically pull data from open data catalogs. This should however be planned carefully, because there is at present much more data available than the community can handle. - Therefore, identifying data should go hand in hand with building a community around the data.

--Beat Estermann (talk) 08:25, 22 December 2017 (UTC)[reply]

Many thanks for this. It looks like Wikidata:WikiProject_Heritage_institutions/Data_sources doing the same job as the data import hub, but focused on a specific domain. Ultimately we want the data import hub to be good enough that you can use that for tracking upcoming, current and past imports instead of needing a separate area. We can use existing areas for reference as we make improvements.
Regarding the idea of using an external vocabulary and data scouts for systematically pulling data - IMO this is a very important area to look at, but agreed it all need to be planned very carefully. I think it should go in the technical/tools area of the todo list for the medium to long term NavinoEvans (talk) 16:50, 30 December 2017 (UTC)[reply]
Moin Moin together, long time ago in any project there were a plan in several phases, like Phase 1 = bring the interwiki-links to Wikidata, Phase 2 = data from infoboxes to Wikidata, Phase 3 = bring lists to Wikidata. However, I can not find the page anymore. But no matter, the principle I found good, well structured. I would say Phase 1 has been completed. But for the Phase 2 we need a plan, which data is still missing and than a plan how we can import this data as simple as possible. For this purpose, a task force should make a list of what data is still missing, which properties belong to each and what you still need for the import. Or is there such a plan already? Regards from Germany --Crazy1880 (talk) 20:40, 9 February 2018 (UTC)[reply]

Comments on Section 5 (Data importing)[edit]

This is the most challenging and baffling part for me. If this could be streamlined, and documented in a way that is easy to find and easy to understand, with tools that are easy to use, it would be the greatest help for Wikisource.

Part of our problem at Wikisource is that we have to create data items of two very different sorts, one for works and one for editions (or translations). Often, we have to create many nearly identical items, for all the editions or for all the parts of a single work or series, but the current input process requires first creating a data item with only a few fields available for entry, then creating and adding each property individually, one at a time, which is a very slow and cumbersome process.

If there were customizable input tools for books / plays / etc. where a user could pre-set data item creation to allow for a whole series of known and expected properties already in, and ready to simply have those bits of data added, it would enormously increase the efficiency of adding those items. I could work ten times faster with a tool like that. --EncycloPetey (talk) 01:27, 25 December 2017 (UTC)[reply]

Have you tried using a spreadsheet program in combination with QuickStatements? - See for example: Editing Data on Wikidata in Spreadsheet Mode? - As you will probably need to add statements containing Q-numbers of items that you are newly creating (work items pointing to edition items, and edition items pointing to work items), you'll need to set up an iterative process that allows you to capture the Q-numbers of newly created items (e.g. from your contributions log or via a SPARQL query).
One could of course imagine better tools to support this, e.g. by adding a functionality to QuickStatements or similar tool that lets you download the ingested data in spreadsheet format along with the respective Q-numbers of newly created items. Or maybe, the ingest-tools will at some point support using variables (unique identifiers) in the source data file, and the tool will take care of the interlinking between newly created WD items. --Beat Estermann (talk) 10:44, 26 December 2017 (UTC)[reply]
Your response suggests (1) that you did not understand my request, and (2) that I have lots of software and experience using that software. I.e.: "Have you tried [jargon involving a platform and OS I don't have]". I don't have QuickStatements, and wouldn't have the slightest idea how that would help enter the data into WD. I do not have any Database program or spreadsheets. And tracking all the relevant data via Spreadsheets would not help if I didn't know the codes for the items to be linked.
What I need to be able to do is create a blank data item with a set of Property statements ready to load, rather than the current method of creating an empty data item, then having to manually select each property (AND its value) one at a time. If I could simply tell the system: "I will be entering values for THIS list of properties" and have all those property lines open, ready and waiting, then THAT is what would be helpful.
By contrast, your method would require me to do all the looking, copy-paste the data into a Spreadsheet, then send all that information bak, asuming that I had the software and experience enough to do that (which I don't). That method would be considerably slower for me. --EncycloPetey (talk) 21:06, 27 December 2017 (UTC)[reply]
So, I guess you are looking for a standard data input form for a given class of items. As far as I know, the tool that comes closest to that is the Monumental Tool with its editing function. But so far, it only works for entries about monuments. Having the possibility of creating customized input forms is definitively on my wish list, but I do not think we are quite there yet.
By the way, you don't need to install Quick Statements; it sits here and can directly be operated from within the browser. And a mail merge functionality can be found in various office suites; if you don't like Microsoft Office, then just use OpenOffice.org, which is available for Windows, OS X, and Linux (and a couple of other operating systems). - Whether you want to give it a try with the approach I suggested is obviously up to you. If you have a lot of data to enter, and especially if it is repetitive, investing a couple of hours to set things up and to learn how to operate the tools, might save you a lot of time down the road... --Beat Estermann (talk) 23:44, 27 December 2017 (UTC)[reply]
As Beat Estermann mentioned, QuickStatements would be the least technical way to achieve that at present. Having a proper template creator is one for the to do list at some point though. QuickStatements does require some basic spreadsheet skills to for preparing the data, but complete beginners can get up and running relatively easily. Note you can also use Google Sheets or another free online spreadsheet if you don't have one installed. NavinoEvans (talk) 17:16, 30 December 2017 (UTC)[reply]
As I indicated before, spreadsheets will not speed up the process I am seeking to streamline. Spreadsheets do nothing related to the problem I am seeking to solve, so I am not sure why people keep pushing for spreadsheets. All spreadsheets would do is add an extra step to the process and slow down the procedure, as well as introduce an extra set of steps at which errors could occur. They can be used to organize information prior to entry, but my issue is with the process of entry into Wikidata, which is completely different. Spreadsheets are unhelpful and unrelated to the issue I have raised, and would make the process slower. So please stop filling this comment thread I started with mountebank hawkings on spreadsheets. Doing so obscures the actual request, and thereby does more disservice than help. --EncycloPetey (talk) 06:45, 31 December 2017 (UTC)[reply]
Hi EncycloPetey, I am working on such a tool and would like to understand your use case better. Can you describe in more detail the format of the data that you wish to import in Wikidata? Is it information stored in templates in Wikisource, maybe? − Pintoch (talk) 15:57, 23 February 2018 (UTC)[reply]
No. In most cases the information is not (yet) stored anywhere. And you might ask, "Why not store those properties at Wikisource/Commons/etc. first, and then import?" But much of what Wikidata wants is not ever stored on any other project. On Wikisource, we seldom (or never) mark instance of (P31), genre (P136), has edition or translation (P747), language of work or name (P407), etc. So this is not an issue of "importing" data from another location, but of adding it de novo.
"Import" may be too generous a word for what I would like to do. What I need is to be able to set (and adjust) a custom set of properties like I were in an edit window. So, when I create a new item, I can let the server know that it is a "book", and a set of expected properties will appear that would normally be added for such an item. Likewise for an "edition", a "translation", a "play", a "1911 Encyclopædia Britannica entry", etc. There are certain sorts of items regularly added by Wikisource editors, and properties normally available to be added, but in most instances that data is not yet stored anywhere. Hence, the need for a customizeable edit window with the option to set certain properties. Ideally, several common sets of properties could be set, and selected with a since toggle identifying which set of properties is wanted, based on the type of item. That way, the whole group of properties would not have to be reset each time. --EncycloPetey (talk) 19:36, 23 February 2018 (UTC)[reply]
Okay, so if I rephrase this in my language, I think it would be something like "provide a customized form to create an item of a certain type, with fields for each commonly used property." I agree that is something that could be useful (but yes, it's not strictly speaking about data import). − Pintoch (talk) 18:01, 24 February 2018 (UTC)[reply]

Comment on section 5[edit]

One should consider asking for importer-s to supply reliable sources before importing their data, or if the import comes from another Wikimedia project to draw the sources used for that information on the Wikimedia project. Data that aren't supported by anything aren't really useful for anyone and may open Wikidata up for manipulation attempts. Jo-Jo Eumerus (talk, contributions) 09:24, 25 December 2017 (UTC)[reply]

Any data users who doesn't want to use data that's imported from a Wikimedia project can filter out data that only referenced by imported from Wikimedia project (P143). Many of the Wikipedia infoboxes that import data from Wikidata do so. ChristianKl12:15, 25 December 2017 (UTC)[reply]
I don't think we can expect every user to realize this, however. And sometimes importing the references along with the data is useful. Jo-Jo Eumerus (talk, contributions) 21:30, 25 December 2017 (UTC)[reply]
I agree, at least partly. Any larger data ingest in Wikidata should start by asking which are the reliable data sources for that ingest. If there are already preexisting databases, we should try to tap them and properly reference them. And ideally get the data holders to cooperate with Wikidata in order to ensure the long-term maintenance of the ingested data. Still, there may be topic areas for which there is no authoritative database, but data can be scraped together from various websites, including Wikipedia. In this case, it could still make sense to ingest the data in order to get a first overview. Of course, someone will at some point need to sift through the data entries and add references. This may often require additional research. Hence the importance of creating communities around the ingested data. --Beat Estermann (talk) 10:53, 26 December 2017 (UTC)[reply]
I agree with Beat Estermann. I think we should strive for good references from the outset, but not enforce it as a rule. There definitely needs to be a some proper guidance on this added to the data import pages. NavinoEvans (talk) 17:05, 30 December 2017 (UTC)[reply]

Methods to automatically add labels to proper entities in Arabic dialects[edit]

When I analyzed Wikidata (when participating in AICCSA 2017 conference), I found that there is no labels for proper entities (people, places, trademarks, events...) in Arabic dialects. However, there are two rules that can be used to automatically add labels to proper entities in Arabic dialects when knowing their labels in Modern Standard Arabic:

  • The label of a proper entity (Person, Place, Trademark...) in Modern Standard Arabic is the same as the one of such an entity in the following Arabic dialects: South Levantine Arabic (ajp), Gulf Arabic (afb), Hejazi Arabic (acw), Najdi Arabic (ars), Hadhrami Arabic (ayh), Sanaani Arabic (ayn), Ta'izzi-Adeni Arabic (acq), and Mesopotamian Arabic (acm).
  • Labels of places and people (if they do not hold another citizenship) from Palestine, Jordan, Syria, Iraq, Kuwait, Yemen, Oman, Bahrain, Qatar, UAE, Saudi Arabia, Sudan, Djibouti, Comoros, Somalia, and Mauritania are the same in all Arabic dialects as in Modern Standard Arabic.

I ask if you can build bots or use QuickStatements for such purposes. This will be helpful to ameliorate the situation of Arabic dialects in Wikidata database. --Csisc (talk) 12:26, 27 December 2017 (UTC)[reply]

@Csisc: Hi, yes, it would make sense to automate this. You can file a bot request on the bot request page. Alternatively, if you want to give it a try yourself, you could also use common office tools and Quick Statements for this purpose (see: how to edit data in spreadsheet mode). --Beat Estermann (talk) 13:16, 27 December 2017 (UTC)[reply]
Beat Estermann: I do not think that using QuickStatements will be a very excellent idea as the number of items is important. I will request for a bot. Thank you. --Csisc (talk) 13:23, 29 December 2017 (UTC)[reply]

Notifications on other Wikimedia projects[edit]

I have notified other Wikimedia projects here:

--John Cummings (talk) 14:43, 28 December 2017 (UTC)[reply]

Improving trust of Wikidata on other Wikimedia projects through a better data import process[edit]

I've gone through previous discussion on other Wikimedia projects about use of Wikidata data outlined at Wikidata_talk:Wikidata_in_Wikimedia_projects and found the following:

A better, easier, more transparent and well documented data import process has the potential to increase trust between contributors to Wikidata and contributors to other Wikimedia projects. Wikidata can increase trust with other Wikimedia projects when importing datasets through:

  • Providing clear information of how data is added to Wikidata and where it comes from.
  • Having a link between an item and the data import record for the data on the item.
  • Having a much higher percentage of referenced statements, this could be done through making it policy to reference statements wherever possible when importing datasets.
  • Being transparent about quality controls on Wikidata data
  • Being transparent about the data import process with documentation of decisions made when importing a dataset
  • Generally making Wikidata easier to learn through better documentation (e.g more Wikidata tours)
  • Collated catalogues of datasets on different subjects has value in of it's self and doing so is essential to understanding topic completeness. One way to do this is to have Wikidata items on each database with a number of items in the database as a statement, allowing users to run this against the number of items on Wikidata to understand completeness.

--John Cummings (talk) 19:00, 16 January 2018 (UTC)[reply]

Using systematic MnM searching for improving data quality (item #6)[edit]

As well as using Mix'n'match to help in the initial match of a dataset, it may also be worth noting that also Mix'n'match in 'search' mode can be a powerful tool for enriching a data-set and/or finding duplicates, especially using it on items that one might expect to have an external link, but there isn't one.

I first was aware of how useful this could be when I started systematically doing MnM searches for Art UK painters that didn't have an RKD identifier, using pages like this one, which also functions as a classic 'redlink' check page, to drive the process.

That page had/has to be updated by a bot running a bespoke script, but User:Multichill has taken the process to a higher level, by using Listeria to automatically generate and re-generate such pages, producing pages such as Wikidata:WikiProject_sum_of_all_paintings/Creator_no_authority_control, for paintings in leading collections, whose creators appear not to have any of the standard authority control identifiers. Using a page like this to drive targeted MnM searches allows all (or at least, very many) such identifiers to be looked for, all at once.

Similarly, here's a page I've just created for members of the Royal Academy in London, who don't have all of VIAF, ULAN, RKD and Art UK identifiers, again creating a list of MnM searches to be run.

Multichill's search is very nice, because as soon as a search has been run, that has found at least one external ID, then the name will drop out of the list automatically at the next Listeria update. But even when that is not required, the MnM column is still valuable, particularly if the set is being focussed on by a small set of volunteers (or perhaps only one). It would be nice if there was an easy way to cross names off as done (perhaps a "most recently searched" field), but even without that, this is still a powerful way to work through cross-matching the items in one's data import against all the catalogues in MnM.

I know that #6 does already mention querying for missing external IDs. But it seems to me that using lists to systematically drive doing this through MnM is such a useful technique that it might be worth noting. Jheald (talk) 00:19, 17 January 2018 (UTC)[reply]

Minor update. I changed the Royal Academy Listeria query slightly, to now list all RAs, with a count of their number of external IDs; and also moved it to en-wiki, to better reveal RAs that don't have articles there. By sorting the table, one can choose RAs from a particular period, ones with fewest IDs, ones without Commons categories, etc, to investigate. Jheald (talk) 23:36, 17 January 2018 (UTC)[reply]
Page moved again, to en:Talk:List of Royal Academicians/RAs, as most appropriate and findable location. Links changed above, to point to the new location. Jheald (talk) 00:05, 28 January 2018 (UTC) [reply]

Other ways to spread data to other Wikimedia projects after an upload (Item #7)[edit]

Item #7 currently looks at how other Wikimedia projects can draw data directly from Wikidata, and issues around that. But in the context of data importing, and mapping current practice, it's perhaps also worth recalling some other ways that improvements can feed into other Wikimedia projects after a data upload here, and how practically to do that at present. In particular, a comprehensive data import here may allow us on Wikipedia or other projects to identify:

  • missing articles
  • missing or incorrect categorisation
  • missing or incorrect uses of templates

Missing articles can be identified directly from SPARQL, by including eg the following clause in queries:

FILTER NOT EXISTS {
    ?article schema:about ?item .
    ?article schema:isPartOf <https://en.wikipedia.org/>.
}

However, for driving a project to create those articles over time, it probably makes sense to create an automatically-updating list with Listeria, like the one for Royal Academy painters in the section above, either to show just the items with no article, or to show all items together with an 'article' column that can be sorted by to show those that still need an article. One can similarly highlight items without a Commons category in the same way.

Sometimes the articles for items in a particular set should have a particular categorisation on Wikipedia. In this case the tool of choice is PetScan. The target category can be put in on the 'Categories' screen of the tool (including a depth of subcategories if required). A sparql query for membership of the target set can be put in on the 'Other sources' screen. (Note that currently the query must not include the underscore character '_'). Filling in "sparql NOT categories" in the 'Combination' field lower down the screen will then produce a list of items without the desired categorisation; while "categories NOT sparql" in that field will give a list of items with the categorisation (incorrectly?) in the articles but not in the set according to Wikidata. Setting the option "From categories" for the 'Use wiki' field immediately above will give the list as Wikipedia articles, rather than items. Going to the 'Output' screen and setting the 'Format' option to "PagePile" and re-running will then open the list in the PagePile tool, from where it can be downloaded as a plain-text list of article titles. This can then be loaded into AWB, which includes an option to go through the list and add or remove a particular category.

Things are slightly more involved for Commons, because PetScan isn't very good at taking a set of Wikidata items and mapping them to corresponding Commons categories. To get around this, one can create the list of Commons categories from a SPARQL query, by including

    ?item wdt:P373 ?commonscat

The results from this can then be copy-and-pasted into the 'Manual list' section of the 'Other sources' screen on Petscan, and one can then proceed as before.

For missing templates, the 'Templates&links' screen on PetScan can be used to generate all pages with a particular template. PetScan currently can't combine this with a SPARQL query. But this can be worked around by going to the 'Output' screen and saving the list of templated pages as a PagePile. PetScan can combine PagePiles with SPARQL queries, so now specifying the PagePile ID in the 'PagePile' field of the 'Other sources' screen, and then "sparql NOT pagepile" in the 'Combination' screen, can be used to produce a list of items with articles that do not have the desired template. Note that the 'Use wiki' field should still be set to "From categories" to get the output to be article-names rather then items, even though categories are not being used in the combination. AWB can then be used to work through the list adding the template, though care may be needed as to where to place it on the page. An approximate placing can often be achieved using AWB's 'Find and replace' function. For example, setting find = (==\s*External (L|l)inks\s*==\s*\n) and replace with = $1* {{TEMPLATE}}\n will try to add the template as bulleted item at the top of the article's 'External links' section (if present) -- though one might then want to move it down the links list manually, before saving. Alternatively, sometimes it may be appropriate to add the template just before the first Category entry. This approach with AWB will work if the template needs no parameters, drawing all the data it needs automatically from Wikidata. Templates which do need different parameter values to be supplied for each article cannot be added in quite such an easy way. For such cases it may makes sense to produce a list of the articles and fully parametrised templates, use AWB to work through the list, inserting an unparametrised template in its edit screen, and then copy-and-paste over the parametrised version before saving.

At any rate, that's my understanding of how this can all be done at present. If people have further hints or tips or alternative workflows that work well for them, it would be good to hear them and share them. Jheald (talk) 11:52, 28 January 2018 (UTC)[reply]

Thanks very much @Jheald:, this is all super useful when documentation starts to be written on this stuff, I can imaging a page called something like 'How to use Wikidata on your Wiki to understand subject coverage' (but much more succinct). --John Cummings (talk) 17:12, 19 February 2018 (UTC)[reply]

License[edit]

Every data import must explicitly name the license of the dataset to be imported, and provide a citation for the license. This should help us to ensure that Wikidata remains clean legally. If the license is not compatible with Wikidata, we should reject it. But it seems there have been cases where the importer suggests an import and no one ever checks if the data is actually licensed compatibly. By adding the license information as a required step for a data import we ensure that this is not missed. --Denny (talk) 15:15, 19 May 2018 (UTC)[reply]

Right now, the term license is not even mentioned on the project page, which is disconcerting :( --Denny (talk) 15:15, 19 May 2018 (UTC)[reply]
@Denny: On the other hand, you have been arguing on various lists, justifying imports from eg Wikipedia, that the licence is irrelevant if only facts are being imported.
But this is all probably for the birds, because I'm rather doubtful that even 1% of imports will bother with the rigmarole being developed on this page. Jheald (talk) 16:05, 19 May 2018 (UTC)[reply]
And I still do, for sources that are text. If the source is a database, then database rights also apply, and there the situation is different, unfortunately. --Denny (talk) 21:36, 19 May 2018 (UTC)[reply]
I would like to see this too. --LydiaPintscher (talk) 21:14, 22 May 2018 (UTC)[reply]

Closing this RFC[edit]

This is currently in our backlog of open RFCs that have remained open for a long time without any activity in the past few months. The initial question What would the ideal data import process look like? does not request a decision, so I am closing it as is. − Pintoch (talk) 17:52, 7 November 2018 (UTC)[reply]