Wikidata talk:Requests for comment/Wikidata to use data schemas to standardise data structure on a subject

From Wikidata
Jump to navigation Jump to search

General reaction[edit]

  •  Support I feel this is absolutely needed and the lack of it is one of the things that currently make Wikidata much harder to use than it needs to be. --Reosarevok (talk) 12:31, 23 October 2019 (UTC)[reply]
  •  Support Also, I predict that we will see patterns in schemata, maybe leading to unifications/optimizations. --SCIdude (talk) 13:25, 23 October 2019 (UTC)[reply]
  •  Oppose the feature has too many unaddressed defects and the only available tool is hardly any help. --- Jura 13:35, 23 October 2019 (UTC)[reply]
    • While I don't doubt it has a fair amount of defects, they can't be improved if not mentioned :) What do you mean with "the only available tool is hardly any help"? Cradle? I doubt it's of much use to experience Wikidatans but I'm sure it can be quite useful for beginners to have a clear set of defined properties to set --Reosarevok (talk) 13:51, 23 October 2019 (UTC)[reply]
      • Oh, it seems that this proposal ignores Wikidata entity schemas. My error. So I guess I oppose this as it wants to rollout a fifth or sixth approach for the same. --- Jura 14:08, 23 October 2019 (UTC)[reply]
        • I've just added something about Shape Expressions now. My bad, I was supposed to expand on John Cummings' initial text before it was shared around too widely. The main issue is that they can only be created and maintained by technical editors. To make good models, we need subject experts to be able to discuss and record their ideas somewhere. Ultimately the resulting models should be made into Shape Expressions by more technical people. In my mind the missing piece is that there's nowhere to propose or discuss a new schema in a way most Wikidata editors would understand -- NavinoEvans (talk) 14:57, 23 October 2019 (UTC)[reply]
          • The entityschema extention to wikidata in its current state is indeed not geared towards less technical oriented editors, or maybe even technical editors. It has a huge learning curve, but a learning curve one has to go through with any schema system. Anyone is more than welcome to join Wikidata:WikiProject_ShEx. Personally, I am really delighted with EntitySchema extension, because we now can host schemas on a wiki. I am a ShEx proponent since 2016 and until the release of this schema extension, I hosted wikidata schemas on GitHub. The extension now allows maintaining those schemas as a wiki. i.e. revisions and talk pages. But now that this extension is in place, tooling needs to be developed to do this. You can now already apply those schema's from within the extension to check for conformance to the schema on Wikidata, but the reporting needs to be made more user-friendly. Next, tools are needs that allow more user friendly creation of these schemas. The problem here is not so much ShEx (or any schema language), but the fact that it is not user-friendly drawing standard. Basically, how would you derive such a schema. There are already very promising developments here. Such as wd-shex-infer, Wikishape or Shape Designer which are three gear towards Wikidata. Basically, my point is not to create yet another schema extension, but let us try to improve or extend the current EntitySchema extension. --Andrawaag (talk) 16:03, 23 October 2019 (UTC)[reply]


  • Did not understand the proposal. What IS a schema exactly ? Where is it stored, how ? If it’s a cradle kind of things, it seems simply doable with a simple use of properties for this type (P1963) View with SQID to list all the possible properties used on a class, see Q5#P1963 an example on human (Q5). Inheritance is then just used as subclass of (P279) (a painter can have all the properties a human have, and there is probably a « subclass of » chain between painter and human somehow. What about Shape expressions ? author  TomT0m / talk page 14:14, 23 October 2019 (UTC)[reply]
  •  Comment Seems rather inflexible to me, if these schemas are going to be normative. Also, as far as I see, Wikiprojects have already laid out models of data and properties, with annotations, for most of the areas you present as being of interest, with good feedback on their talkpages. These talkpages, followed by those who are interested in the particular types of items, are IMO probably better spaces than trying to centralise all such discussion in a central forum, where most of the discussions with not relate to the items that any individual user is most interested in. It all seems a bit overly bureaucratic. Therefore, as it stands, I would tend to oppose.
The best advice remains to find a few well-developed items similar to the items you want to create, and see what statements and qualifiers they use. If desired, queries like those at Wikidata:WikiProject_Maps/stats can be useful to give a systematic survey of what properties are being used on a particular group of items. Jheald (talk) 16:10, 23 October 2019 (UTC)[reply]
I personally would welcome an initiative that would allow me to spend less time correcting edits that do not conform to the standards of the wikiprojects I am involved in. If schemas are able to guide editors to conform to wikiproject-established standards, I would support it. Beleg Tâl (talk) 15:49, 11 November 2019 (UTC)[reply]
  •  Support This proposal makes complete sense to me, on the whole. I am relatively new to Wikidata but since the 1990s I have been active in the Dublin Core and W3C communities, where what we are calling here "data schemas" are called "application profiles" (Dublin Core) or "data profiles" (W3C). The proposal emphasizes prescriptive uses of profiles - e.g., as "sets of rules" that "govern" a database, providing a "standardised structure" so that "all the items on museums would use the same structure". But in my experience, profiles serve first and foremost as vehicles for formulating and recording community consensus. Since there is never just one best way to model anything, there might be variants - e.g., a "strict" schema for creating new items and a "tolerant" schema for validating existing items - or there might even be more than one way to model something. For many years, profiles were worked out on whiteboards and mailing lists, then recorded in PDFs or Web pages and handed off to developers for implementation. I have become involved with the ShEx initiative because ShEx finally provides a way to express a profile in a machine-actionable way, e.g., for validation - a goal we had pursued without much success in the 2000s. In DCMI, we are currently trying to bridge the gap between ShEx experts and "spreadsheet-enabled" content experts who prefer to work with tables. Our goal is a simplified language for describing profiles usable, for example, for column headings in a spreadsheet - a spreadsheet that could then be converted on the back-end into a ShEx schema for further development. I agree that it would be great to have a space where schemas could be discussed and developed as documents or tables before being fixed in a ShEx schema. One final comment re ShEx: while it is true that ShEx "allows you to define very precise schemas", it also allows you to define, with precision (!), schemas that are quite loose and tolerant. Tombakerii (talk) 11:16, 24 October 2019 (UTC)[reply]
  •  Support My experience is in the taxonomy part of Wikidata and I feel that there something like this is badly needed. The taxonomy data model to define the taxon author citation information is very complicated, and it is defined only through the project tutorial page and through discussions on the project talk page and property talk pages; there is no rigorous schema. I tried to do an exercise to load some of this information into Wikidata from an external source based on my interpretation of the tutorial and I ran into a conflict with another user with different ideas, and was not able to reach agreement on suitable rules to allow me to continue. In my opinion anyone else who tried to load the same information through software would run into a similar problem - the taxononomy part of Wikidata is broken. If it had been accepted practice to create a rigorous schema covering all possibilities for a complicated case like this, I think this could have been resolved. I think wherever the data model gets at all complicated, it must be a common situation in Wikidata (a database which anyone can edit), that the lack of a rigorous specification leads to sections of the database becoming unusable. Strobilomyces (talk) 17:49, 11 November 2019 (UTC)[reply]
  •  Support This proposal seem sensible to me. I'm still new on WD but when I tried out the human schema check on swedish humans I got useful feedback about what to fix.--So9q (talk) 08:29, 13 November 2019 (UTC)[reply]
  •  Comment The RFC has just been updated (14 Nov 2019) to clarify how the proposal relates to the existing work on Shape Expressions. The scope has also been refined to cover only the initial discussion/proposal of new schemas. NavinoEvans (talk) 15:52, 14 November 2019 (UTC)[reply]
  •  Support I think this proposal is a great idea :) Having a proper system for proposing, discussing, creating, and modifying data schemas would be really helpful. I've been running a site for a few months now that uses Wikidata items for video games as its data source. The data model is generally pretty consistent, but it would be really nice for newer users to be able to reference the existing schema to figure out what statements to add, what to have for new items, what to query for when they want to know about instances of a given type of item, etc. I'd definitely be interested in helping create a schema for video game items. Nicereddy (talk) 23:13, 14 November 2019 (UTC)[reply]
  •  Support We've always anticipated and commented on this need, as did and regularly do many newer contributors coming from a background with other structured or semi-structured data (e.g. librarians). This is a solid proposal, and we are at a good stage of maturity to embrace this and evolve our data modeling discipline. Asaf Bartov (talk) 21:18, 18 November 2019 (UTC)[reply]
  •  Support Support wholeheartedly as a creator of Cradle forms and documentation nerd. The revised RfC seems to correctly summarize the current state of play, acknowledge the questions to be resolved as we go, and show a productive way forward. - PKM (talk) 20:58, 21 November 2019 (UTC)[reply]
  •  Support as revised, per Asaf. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 14:48, 22 November 2019 (UTC)[reply]
  •  Support This is a logical move to support the evolution of WikiData. Islahaddow (talk) 14:55, 22 November 2019 (UTC)[reply]
  •  Support I believe this proposal fits in the general idea that the community can propose and discuss data, extending it to schemas. Especially when it comes to modeling the data structure, it is important to include a wide range of people and make sure they are able to discuss the representation of our data, independent of being able to write the schemas themselves. This is an important step in the right direction and can foster a good community experience as a side effect. --Frimelle (talk) 12:30, 23 November 2019 (UTC)[reply]
  •  Support I fail to recognize whatever technical proposal is happening here. The examples like EntitySchema:E42 do not seem useful to me and I am not sure how anyone is supposed to respond to these or make sense of them. However, it seems like we are discussing this socially, and considering whether we should begin to standardize what previously most WikiProjects have been calling "data models". I support that. We currently have no system for centralizing data models and many WikiProjects independently create similar ones without having easy access to the curation done in comparable cases. I think it is correct that this schema system would centralize data model discussion and it would make data models machine readable. Right now looking at examples this seems to not be possible or the case, but I think the discussion here is about whether it should be, and I think yes. Blue Rasberry (talk) 11:05, 24 November 2019 (UTC)[reply]
  •  Support Totally agree! We need this. May Hachem93 (talk) 14:08, 25 November 2019 (UTC)[reply]
  •  Support I agree. This will ease the manual editing process. Antoine2711 (talk) 04:36, 6 December 2019 (UTC)[reply]
  •  Support I agree. I think it will ease the manual editing process and as said above, potentially foster a good community experience. Jamie-NAL (talk) 18:18, 15 January 2020 (UTC)[reply]
  •  Support really need this and I would even go as far as to say bots should not be allowed to operate before there is at least some notice of the data model their edits will conform to. Iwan.Aucamp (talk) 23:43, 15 March 2020 (UTC)[reply]

Outstanding questions discussion[edit]

What to include as values in the example[edit]

If we want to store the schemas themselves in WD, my first instinct would be to just use the appropriate property with the special value "Unknown data" (rather than examples or "fake" items). --Reosarevok (talk) 12:31, 23 October 2019 (UTC)[reply]

  • Yes I think that is a good idea, also maybe a Wikidata item that can be excluded from query results (I'm very aware that items for schemas could appear in query results). One other (maybe competely over the top solution) would be to have another class of items (like Qs and Ps) for schemas (I guess S?) --John Cummings (talk) 12:44, 23 October 2019 (UTC)[reply]
There is already the Es (e.g. this schema for [genes])
So there is already a special item type specifically for schema? Is there a place for the community to discuss and agree on a schema for a subject? Thanks --John Cummings (talk) 16:27, 23 October 2019 (UTC)[reply]
There are schemas such as EntitySchema:E10 and so on (generally: E-items in the EntitySchema namespace [1]). Tooling is about to be developed by volunteers on Toolforge and external servers, including tools to compare items against schemas. Some core components might come from WMDE as well, and we need technical improvements in some areas—see the phabricator tag for ShEx.
Right now, anyone can create schemas via Special:NewEntitySchema without prior discussion, but one needs to write the schema in ShEx notation. That approach can of course lead to competing contradictive schemas for a particular scope, but until now there were no complaints about that fact. We still do need to arrange how to make ShExs re-usable by introducing some naming conventions, however. —MisterSynergy (talk) 16:39, 23 October 2019 (UTC)[reply]
Agree that ShEx is the way to go, as it already addresses this. -- Fuzheado (talk) 19:26, 23 October 2019 (UTC)[reply]

How should schemas relate to each other[edit]

Would it be possible to make, say, Painter Schema be a subclass of Person Schema? Unless we write specific code for schemas, we might still need to enter all the appropriate properties on the schema by hand at first, though, even if there's some sort of "X is a specific subset of Y" property... --Reosarevok (talk) 12:31, 23 October 2019 (UTC)[reply]

How could you define the most wanted or most valued statements in a schema?[edit]

I suggest using three groups as well, but borrow some language from TemplateData. That would then be "required" and "suggested" for the first two. Instead of "would be nice" I would go for the more neutral "optional" for the third. Ainali (talk) 21:26, 18 November 2019 (UTC)[reply]

Added {{Draft}}[edit]

Since the RfC for some reason completely ignores the ShEx technology we already have in Wikidata since this summer, I have added a {{Draft}} marker to this RfC. There is indeed need for discussion regarding Shape Expressions, but this RfC currently ignores it for whatever reason. I can provide some more input this evening, in case this is necessary. --MisterSynergy (talk) 14:26, 23 October 2019 (UTC)[reply]

Btw. Wikidata:WikiProject ShEx would be a starting point for those who have not heard of Shape Expressions. --MisterSynergy (talk) 14:27, 23 October 2019 (UTC)[reply]
I expect a lot of the ignoring this is because most people have never heard of this before. I certainly hadn't, and looking at this, it seems to be fairly useless for beginner users since it's even scarier than the rest of Wikidata :) Is there already a user-friendly hood over this that can be used to add an entity that conforms to the schema without needing to understand validation etc? --Reosarevok (talk) 15:12, 23 October 2019 (UTC)[reply]
Yes, I agree ShEx can be scary at first, but I am not convinced yet another schema language would solve this. There is a full landscape of similar efforts already active on Wikidata. E.g. the constraint violations. If you are at WikidataCon, there is a workshop planned to introduce Shape Expressions --Andrawaag (talk) 15:29, 23 October 2019 (UTC)[reply]
My main objective with this is to:
# Create an easy beginner friendly place to find out how to model data people want to add
# Have a place where the community can agree on how to model data on a subject
My understanding is that ShapeExpressions can't do either of these things?
--John Cummings (talk) 16:32, 23 October 2019 (UTC)[reply]
@John Cummings: Presumably the schema won't change a lot over time. I think it's OK if the documentation of the schema in layman's terms is separate from its implementation in code. That's certainly been the case for most databases I've worked with. - Jmabel (talk) 16:40, 23 October 2019 (UTC)[reply]
Yes, I assume that they won't change much over time. I think having a 'light' version would be really helpful. --John Cummings (talk) 16:52, 23 October 2019 (UTC)[reply]

Miscellany[edit]

I think this is generally a good idea. It might be possible to implement using one of our existing schema technologies (ShEx or another). I think we should keep this proposal focused on intention and requirements, and leave the technological issue aside until that is resolved.

I think it's a good idea to think of a schema providing a core set of data that would be uniform across a class such as museums. However, the spirit of Wikidata seems to me to imply that it should also be OK to hang on additional properties that are not included in the schema. There will be times when we need to record some unusual fact about a particular museum that is unusual enough that it oughtn't extend the schema, but it should be there.

(If anyone wants my further attention here, please ping, I don't use a watchlist on Wikidata.) - Jmabel (talk) 16:38, 23 October 2019 (UTC)[reply]

Thanks very much @Jmabel:, how do you think we could describe this without including technology? I agree about additional properties, just getting the core information consistently modelled. --John Cummings (talk) 17:22, 23 October 2019 (UTC)[reply]
@John Cummings: The examples I have where I've done things like this are all RDBMS rather than something like Wikibase, and they are proprietary work for my clients, so I can't post publicly online, but given that we've met in person, if you think it would be useful to see the sort of way I do this, and you're willing to promise non-disclosure of the content, I can send you via email some examples of the sort of thing involved. You won't be able to pass them on, but you can use them yourself as a model of what you might do to get agreement at the relevant level of abstraction. I don't know your email address offhand, so if you want to do this, email me via the wiki, and then I'll have an email address to reply to. - Jmabel (talk) 03:01, 24 October 2019 (UTC)[reply]

In light of ShEx, perhaps a re-orientation[edit]

Several other folks have pointed out what should be obvious now - we already have Shape Expressions (ShEx) to do the technical implementation of this. It is being resourced and developed and will have the internal hooks in Wikidata to guide users towards a particular schema even if the interface is rather crude and arcane now.

So perhaps we could reframe John's request as: how might the shapes of these schemas manifest themselves in the interface and modeling decisions and be better introduced to newcomers? If I might paraphrase what Denny said in his initial Wikidata blog post from 2013 - Wikidata prefers guidance rather than restrictions. Because the world is complex, a unified database of that combines the complexity of thousands of separate disciplines and fields of study means there's a lot to figure out.

That said, we have an odd smattering of tools to help folks figure out how does a new addition to Wikidata fit into accepted norms and practices. In order, from light to heavy:

  • Recoin provides suggestions based on a history of previous items of the same instance.
  • Cradle provides a forms-based interface based on a definition of a model item.
  • Constraints in the form of particular values (eg. single value), relationships or formats (ie. "format as a regular expression").
  • ShEx in the future

That said, putting all these on a page to introduce folks to the "light hand" of Wikidata when it comes to schemas is a good idea. Right now, the only place I can think of where people are steered to is Help:Basic_membership_properties, and that has not been revamped in a long time. In the future, I could imagine a graphical tool that showed different Shape Expressions visually, like Wikidata Graph Builder, to help people better understand them. -- Fuzheado (talk) 19:54, 23 October 2019 (UTC)[reply]

Help:Modelling was aimed to be a landing page for this, initially. author  TomT0m / talk page
I think an expanded Help:Modelling could be the third leg of a tripod, with ShEx (extremely technical), Cradle forms based on the "must have"/"should have" sets of statements (extremely easy), and then Help:Modelling (complex but not technical), with links to model items as examples. - PKM (talk) 21:04, 21 November 2019 (UTC)[reply]

Ways to approach the RfC[edit]

Having a good way to record schema would be valuable. On the other hand the draft proposal currently doesn't seem to be fully thought out. In the past we often developed drafts for policy before putting them into an RfC. When votes are cast in an RfC. The proposal currently contains a non-answer when it comes to describing what a schema is supposed to be. If we make a decision about schema's we have to be specific about what information we want to record in the schema. Before deciding on the abstract principles of a new proposal for schema I would like to see examples of how they would look. https://test.wikidata.org/wiki/Wikidata:Main_Page can be used for prototyping item based schema if that's the proposal. ChristianKl20:53, 23 October 2019 (UTC)[reply]

Showcase items don't show best practice[edit]

For most of the time Wikidata existed showcase items didn't show best practice. They might have been best practice at the time they were created, but as we changed our rules and ideas about the importance of sources the showcase items generally didn't change to keep up with what we consider to be best practice. In particular our community agreed that ethnicity shouldn't be added without sources and many of the showcase items did so. This experience suggest that when we design a new way to define data structures we have to think hard to avoid an outcome that leaves a bunch of unused out of date schema laying around. ChristianKl20:53, 23 October 2019 (UTC)[reply]

Rewriting[edit]

Hi all

Based on the conversations here and presentations at Wikidatacon it seems like Shape Expressions are suitable for recording schemas, I'll work on rewriting this RFC to take this into account.

Thanks

John Cummings (talk) 15:05, 1 November 2019 (UTC)[reply]

@John Cummings: Thanks for doding this. I think the RfC is on the right track. - PKM (talk) 21:10, 21 November 2019 (UTC)[reply]

Addressing the underlying WHS problem[edit]

Please see Wikidata:Project_chat#What's_a_World_Heritage_site?. --- Jura 10:27, 6 November 2019 (UTC)[reply]

Thanks, this would be an interesting early use case, modelling World Heritage sites well requires quite a bit of indepth knowledge about the sites.
--John Cummings (talk) 15:29, 6 November 2019 (UTC)[reply]
Let's try to address this now. Otherwise it appears that neither of you are actually interested in it. --- Jura 15:34, 6 November 2019 (UTC)[reply]
The issue is where to 'address it', if we discuss on the project chat it will just get archived and the same thing will happen again. I've corrected info on World Heritage many times and the same issues happen over and over again because people can't see how to model them. --John Cummings (talk) 10:35, 7 November 2019 (UTC)[reply]
Where did you write it down? --- Jura 08:51, 13 November 2019 (UTC)[reply]

──────────────────────────────────────────────────────────────────────────────────────────────────── Perhaps this should be part of Wikidata:WikiProject Built heritage? The “Data structures” tab there is a redlink. I’d like to see it built out something like Wikidata:WikiProject_Visual_arts/Item_structure. Maybe the first section can be World Heritage sites? Then the place to discuss the proposed model would be the tslk page for that project. - PKM (talk) 02:49, 15 November 2019 (UTC)[reply]

Another thing we could do right now, tomorrow, is build a Cradle form for a World Heritage site. - PKM (talk) 03:10, 15 November 2019 (UTC)[reply]

I used a free software tool ShapeDesigner (Q65589138) to extract a draft schema for UNESCO World Heritage site. I then shared it in namespace E EntitySchema:E142. Perhaps this could be a starting point for discussion of how to refine the schema? YULdigitalpreservation (talk) 13:45, 15 November 2019 (UTC)[reply]
I think the people who are lost with the question at Wikidata:Project_chat#What's_a_World_Heritage_site? or a property listing on a Wikiproject are even less likely to be helped by this. --- Jura 14:07, 15 November 2019 (UTC)[reply]
@John Cummings, Jura1: Here's another thought: Perhaps a new tab on Help:Modelling for "protected areas and built heritage"? There are very specific best practices for places on the US NRHP as well. That allows us to address entities that combine built heritage and natural areas, and thus don't fit well within existing WikiProjects. - PKM (talk) 21:09, 21 November 2019 (UTC)[reply]
I think the questions are fairly trivial, but we need to sort them out first. Once done, it can be noted in the relevant places. --- Jura 15:17, 22 November 2019 (UTC)[reply]

It depends on your purpose[edit]

When you are adding co-authors to papers, the important bit is to indicate the author and any and all authorities you can find. The point of view is scientific papers and all that is relevant in the context is to ensure no duplication based on the authorities. When you add awards, details on people are secondary to completeness authorities, date of birth, death are secondary. When you work on science awards, it is first the list and because of the enforced lack of tools, authorities come second. The lack of tooling removed the incentive to do it at the same time because the gratification of a Scholia for the award the person is gone. This approach is the same approach that is used by the tooling; you add what is important to the tool, you do that well, the rest is extra. Thanks, GerardM (talk) 18:45, 11 November 2019 (UTC)[reply]

Comments on the "Oustanding questions" section[edit]

  • What format should the community discussed, non technical schemas be presented in when ready for a Shape Expression? Could we just use existing templates (e.g. Statement+), or Tables in Visual Editor?
    • One question would be whether example statements should show references, or whether a generic note that "all statements should be referenced" is sufficient, with a link to best practices for referencing. (I prefer the latter, since it makes the modelling examples cleaner and easier to understand). I am agnostic between Statement+ and tables, though I will say I find Claim unnecessarily busy. I would dearly love someone to build a tool that you could point to a model item and have it generate a Statement+ or a tabular view or something. I would do a lot more documentation if it wasn't so hard to make examples. - PKM (talk) 21:34, 21 November 2019 (UTC)[reply]
  • How could you define the most wanted or most valued statements in a schema? Maybe just 3 groups, "must have", "should have", "would be nice"?
  • How to capture agreement on granularity of items? E.g a museum as one item or as two items, the building and the legal entity
    • There should be separate schemas for "museum (institution)" and "museum (building)". Help:Modelling would be the place to explain that an institution and the building where it is located should ideally be two separate items. We may always have items that combine these things, but the best practice that we promote should be two items. - PKM (talk) 21:34, 21 November 2019 (UTC)[reply]
  • Where do we draw the line from one model to the next? ShEx allows you to extend any model to include expectations about the items that are linked (e.g. museum should have a location, but that location should also have coordinates). It would seem logical that we do not extend beyond the model's boundaries. Other connected item should have its own model to follow.

Many thanks for all the input! I wholeheartedly agree with all of the suggestions you've added here. NavinoEvans (talk)

Cradle recently gained alpha support of ShEx[edit]

It parses the schema on the fly. See https://tools.wmflabs.org/cradle/#/shex/E10 code here https://bitbucket.org/magnusmanske/cradle --So9q (talk) 08:34, 22 March 2020 (UTC)[reply]