Wikidata talk:Lexicographical data/Archive/2022/10

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.


Getting audio files from a lexeme

Hi,

We are working an a project that would allow getting pronunciation audio files for a word in a given language from a lexeme. The word, language, and lexeme values would be provided by the user. For instance, the user wants to be able to grab the audio file for "tomato" in language "en-au" from Lexeme:L7993. By looking at this json response, we are assuming that we need to loop through the forms, check if the form has an exact match for the word "tomato" in the list of representations, and if so, we would check if the form has an audio file that matches the language qualifier "en-au". Would this logic make sense? HMonroy (WMF) (talk) 19:09, 13 September 2022 (UTC)

@HMonroy (WMF): yes absolutely, audio are store on forms so this is the right and only way to do it.
By the way, could you tell us a bit more about this project? It sounds very interesting and it could help make adding the pronunciation more attractive.
Cheers, VIGNERON (talk) 10:20, 9 October 2022 (UTC)

Attested in this sense

I'd like to cite examples from published media (typically newspapers) when a lexeme has been used in a particular sense, but the property I have tried, attested in (P5323), turns out not to be allowed on senses, only on lexemes and forms (plus Wikibase items), and this seems to be by intent, so I'm not challenging that.

Looking into related properties and their constraints, I find no other property that would suit my purpose better, but instead found some qualifiers that I think will resolve the issues I have, most notably subject sense (P6072).

I'm therefore currently using the sense citation format shown in the example below, but before I continue using this format, I'd like to invite your comments and potential objections. Am I missing some relevant qualifier? Which ones are unnecessary, fully redundant, or even inappropriate?

The sample case shown here pertains to the Swedish word ombudsman (L239133) (which is the etymological origin of the corresponding English word ombudsman (L299316)), and besides this one for the original and broader sense of the Swedish word (meaning "proxy"), I have added a similar citation for the well-known public office established after the political events of 1809:

Ok? SM5POR (talk) 17:51, 7 October 2022 (UTC)

@SM5POR: yes, I think this is the right way to do it (at least, it's how I do it and we did it for described by source (P1343) on lexemes like devezh (L627477)). Cheers, VIGNERON (talk) 10:13, 9 October 2022 (UTC)

How to model nouns that can be inflected by gender

Hello,

I'm talking about French (Q150) here, but I'm pretty sure it can be applied to (at least some) other languages.

We have nouns that can be inflected simply by gender, like occupations (computer scientist (Q82594)informaticien (L620286) / informaticienne (L620287)), animals (dog (Q144)chien (L241) / chienne (L29225)), etc.

At the moment, on Wikidata, the current model is applied by Metamorforme42 on several French lexemes (for instance informaticien (L620286) / informaticienne (L620287)). It is tedious (impossible?) to navigate from a gender to another (as hyperonym (P6593) can be used for disambiguation based on other criteria than gender). This model also leads to the creation of questionable items like computer scientist (Q113547263) and computer scientist (Q113547227), with many issues:

  • an obvious confusion between grammatical gender and other genders;
  • if we start creating items as a combination of their properties, it will quickly become unsustainable (soon, an item for Male / British / Computer scientist / born in the 20th century?).

I wonder if it would be better to merge lexemes like serveur (L17430) / serveuse (L673611). It's the choice made by several dictionaries (example for informaticien (L620286) / informaticienne (L620287) in Le Robert, TLFi). One advantage is that it is straightforward to get the feminine forms of a noun. It also avoids to duplicate similar glosses.

Note that:

  • in French, the masculine form is also the "general" (there is no neutral grammatical gender);
  • merging lexemes would not always be possible: for instance, for horse (Q726), cheval (L19113) / jument (L25951) should obviously stay separate.

What do you think?

Ping @Denny: as, if I remember correctly, he talked about his experience at Google about that on the Telegram channel a few months ago.

Cheers, — Envlh (talk) 21:25, 17 August 2022 (UTC)

I like the idea and would suggest to merge them, but @Nikki: had a number of good arguments against it. --Denny (talk) 21:28, 17 August 2022 (UTC)
Hi @Envlh,
this model is adapted from Wikidata_talk:Lexicographical_data/Archive/2021/12#male and female variants of lexemes (original usecase with Lexemes in German).
Also, I seriously doubt there would be a notable Lexeme for Male / British / Computer scientist / born in the 20th century, so there is a limit in the creation of this kind of items. These two specific items were created to replace inappropriate item for this sense, and because there are some scientific articles related to computer scientist (Q113547227) for example (structural need).
I agree about the navigation issue: this have been painfull to add specified by sense (P6719) qualifier on senses because of the separation of Lexemes.
About the confusion between grammatical gender and other genders, could you please elaborate as it is far from obvious to me?
Metamorforme42 (talk) 21:53, 17 August 2022 (UTC)
We have semantic gender (P10339) for specifying that a sense is specific to a particular gender, so there's no need to create items like computer scientist (Q113547263) and computer scientist (Q113547227) even with separate lexemes. (And on items, like for scientific articles, we'd normally use sex or gender (P21) as a qualifier, so I don't agree that there's a structural need for them). - Nikki (talk) 22:56, 17 August 2022 (UTC)
We had a lot of talks about this subject. See for instance Wikidata:Making_sense#Gendered_professions (by Denny) or this discussion I launched: Wikidata_talk:Lexicographical_data/Archive/2019/05#Lexemes_and_gender_of_noun (Lehrerin (L34168) was already an example on the beta test before the lexemes were created). Several people have a lot of agument against merging but megering also have it perks. My main question is: if we merge them, how would we indicate which forms refers to which set of senses? (and the other way round) Imagine merging meunière (L25640)/meunier (L306576) (where I put a lot of data specific to each lexeme).
I hear the navigation issue. Maybe we can revive this old proposal Wikidata:Property proposal/noun for other gender?
For the « One advantage is that it is straightforward to get the feminine forms of a noun. » will it be? many noun have several feminin forms : autrice/auteure, docteure/doctoresse, and so on (and I'm not even talking about about case like "une docteur" or variants like "doctoresse" or "meusnière")
Right now, lexemes are pretty empty but we need to think and have a general model when there will be all forms and senses. Maybe there is a model where we can have only one lexemes but I don't see it now and I'm leaning towards the separate lexemes.
PS: the TLFi has two URLs: https://www.cnrtl.fr/definition/informaticien and https://www.cnrtl.fr/definition/informaticienne (with same content but still two URLs) and their form tools also separate https://www.cnrtl.fr/morphologie/informaticien from https://www.cnrtl.fr/morphologie/informaticienne
PPS: Q113547263 and Q113547227 should most likely be deleted merged (and this has nothing to do with Lexemes, it breaks Wikidata general rules and customs).
Cheers, VIGNERON (talk) 08:48, 19 August 2022 (UTC)
I redirected Q113547263 and Q113547227 to Q82594. Feel free to delete them if required. The creation of these two items was retrospectively a bad idea; thanks for having pointed this out, I will pay more attention to notability before creating items from now. — Metamorforme42 (talk) 15:09, 19 August 2022 (UTC)
@metamorforme42: thanks, and sorry I meant merge not delete. All is good now. Cheers, VIGNERON (talk) 07:38, 20 August 2022 (UTC)

Thanks to Metamorforme42 for merging the items. For some explanations on genders, you can read Gender on Wikipedia or see the list of allowed values for sex or gender (P21) (to sum up, gender is not binary, and you can't just have male opposed to female).

For the auteure/autrice issue, I think this is a more general issue to mark forms that fit together within a lexeme, not limited to the discussion here. Sometimes, we can do it easily, for instance marking some forms with orthographic corrections of French in 1990 (Q486561) on chariot/charriot (L25948). But sometimes, it's not the case, like balayer (L689016) which can be written with a i or a y on several forms, and nothing to group similar forms together (and I don't think we should create a new lexeme to distinguish these forms).

About the URLs of TLFi, you realize that this is just a search engine? Its URLS doesn't contain IDs, but only what you input in the search engine. You have informaticien and informaticienne, but also informaticiens and informaticiennes (I don't think you want new lexemes for these plural forms?), and even garbage like informàtïcién, that all return, as you noted, the same unique entry.

It was repeated several times that there were lots of arguments in previous discussions against merging lexemes like informaticien (L620286) / informaticienne (L620287), but I was unable to find any. It is sometimes stated that the masculine and the feminine are two different concepts (we just proved the contrary with the merge) or that one is derived from the other. I disagree with these. A computer scientist is still a computer scientist, regardless of their gender: male, female, or anything else. In my opinion, gender is just a grammatical trait that can be used to inflect a noun (of course, not always, there is for instance no feminine for cheval (L19113)), like the number. And you don't create a new lexeme when a word has a plural form. As already stated, at least some sources (Le Robert, TLFi, and let's add another one: Larousse) seem to agree with that, as they have a single entry and a single definition fur such lexemes.

For informaticien, this would gives:

I don't see the point to have several senses for different genders. If we really want them, maybe we can create other senses, and use a combination of grammatical gender (P5185) and semantic gender (P10339) to specify that theyr are limited to some genders?

For meunier, this would gives:

The advantages of this model:

  • you can easily find feminine variants of a word;
  • you don't duplicate senses for each existing gender.

With the models currently used in informaticien (L620286) and meunier (L306576) (if you look closely, you can see that they are not exactly the same), you duplicate senses (and everything that comes with them like properties, examples, etc.), and I've no idea on how to find their feminine variants. Maybe the revival of Wikidata:Property proposal/noun for other gender proposed by VIGNERON is a good idea.

Cheers, — Envlh (talk) 10:25, 20 August 2022 (UTC)

  1. With this model, how are we supposed to express that F3 and F4 cannot be used with meunier@fr-S1 to refer to male miller (Q694116)? We never say « une meunière » to describe a male miller (Q694116) (« Johann Georg Hiedler (Q385804) est une meunière. » is wrong) or « des meunières » to describe a group containing male miller (Q694116).
  2. Even if I agree that « informaticien » and « informaticienne » are primary referring to the occupation, I think there is a weakly defined concept combining this specific occupation and a certain kind of gender (but in a binary way, and not well defined) and we can find expression of this concept with a single word in some languages like French, German… In my opinion, this word is a sense and not only a form because it correspond to a different concept; for example:
  3. Also, maybe it would probably be relevant to ask some experienced contributors from frwikt to explain us why they have separate entries for informaticienne and informaticien.
Metamorforme42 (talk) 14:20, 20 August 2022 (UTC)
Short answer for 1: it is implicit, like it is implicit that F2 is not valid for your example (Johann Georg Hiedler (Q385804) est des meuniers is wrong). Cheers, — Envlh (talk) 16:17, 20 August 2022 (UTC)
@Envlh: that could maybe work, the implicit inference could be miss by some re-user but so is many subtleties... (and we could still document it to ease the re-use).
What about the others statements refering only to a specific set of sense of forms? For instance described by source (P1343) (or any identifiers) on meunière (L25640) where I specificied the senses (to see when and how the word evolved, to see that "miller's wife" was originally the only sense and that at some point it disappeared and came back again... it allows to do a query like this: https://w.wiki/5bNz - I only add the basic data as a test but I would love to add dozen more sources to have better results). Could we use some qualifier? (semantic gender (P10339) and grammatical gender (P5185) again?) Even merged, I feel like keeping gendered senses would be simplier, no?
Cheers, VIGNERON (talk) 18:40, 20 August 2022 (UTC)


Hello. What you indicate for French, happens exactly the same in Spanish. On the one hand, it seems logical that the different gender inflections are in a single lexeme and are introduced as forms. On the other hand, if it were done that way, there would be a lot of information that I don't know how it could be added to Wikidata. I show some examples.

Lexeme gato (L34279) has two genders (masculine: gato; feminine: gata). However, not all senses have two genders, some are only masculine or only feminine.

In that specific lexeme as it is created right now, both genders are in a single lexeme, and the gender is indicated at the sense level, but I am not sure if this way is right.

In general, at this moment most noun lexemes in Spanish are created with a lexeme for each gender. For example, lexemes potro (L620468) and potra (L620476).

Advantages of this second way:

Disadvantages:

  • duplicate senses

--Hameryko (talk) 19:55, 22 August 2022 (UTC)

 Comment @Hameryko: for me it's obvious that 3 "gato" should be in 3 different Lexemes - they are homonyms (at least different etymology)...  – The preceding unsigned comment was added by Infovarius (talk • contribs).
@Hameryko: I agree with Infovarius, this is obviously a different case and different lexemes; gato (L34279) should be split.
@Envlh: could you try to simulate a merge of meunière (L25640) and meunier (L306576) in sandbox 2 (L1234). I'd like to see how it would work exactly in such complex - but not unusual - cases (with a lot of qualifiers I guess).
Cheers, VIGNERON (talk) 11:38, 2 September 2022 (UTC)
@VIGNERON: done, with last edit being Special:Diff/1720989563. Please tell me if something is missing. Cheers, — Envlh (talk) 19:09, 4 September 2022 (UTC)
Hello @VIGNERON: did you have the time to review this model? Cheers, — Envlh (talk) 15:34, 9 October 2022 (UTC)
To your point about translations, translations are linked from sense to sense so you can still do this with shared-gender lexemes. If a lexeme has a completely different etymology I model it separately, or if it is a completely different part of speech. I have been adding lexemes in Punjabi with some of them having both grammatical genders and I can show some examples if it would be helpful. Modeling grammatical gender on separate lexemes is less viable for this language, because about a third of nouns have both masculine and feminine inflections, and the reason for this is different for each often depends on the underlying meaning of the noun.
I will note that I do not use the grammatical gender property on senses as this is not really a semantic feature. Instead, I use multiple statements on the lexeme itself, and subject sense qualifiers. This way, it is easy to see that some senses work with both grammatical genders, while others only work with one. Semantic gender I put on senses if necessary on animate nouns.
  • Inanimate noun ਲਿੰਗ/لِنگ (L684192) has 3 senses meaning sexuality/gender (abstract concept), penis, or arm. If you are using it to mean penis or arm it has to be masculine, but you can use it as masculine or feminine for sexuality/gender. The masculine and feminine forms are identical in singular direct case (ਲਿੰਗ), but the surrounding sentence would have to change inflections for the gender. Both the Punjabi labels for grammatically feminine (ਇਸਤਰੀ ਲਿੰਗ) and masculine (ਪੁਲਿੰਗ) are derived from the masculine form of this word. Since the word for "feminine" here is derived from the masculine form of a word that allows a feminine form, we cannot assume anything about the relationship between semantics and grammar.
  • The first part of the word for grammatically feminine, ਇਸਤਰੀ, is a homograph that is a good example of an identical lexeme that should be separate. ਇਸਤਰੀ/اِستری (L700203) means clothes iron and is derived from Portuguese, but ਇਸਤਰੀ/اِستری (L700209) means woman or wife and is derived from Sanskrit.
  • Animate noun ਡੱਡੂ/ڈڈّو (L678986) has an unspecified gender sense for a frog that must be grammatically masculine. There are separate senses for male and female frog which are connected to masculine and feminine respectively, but the more common sense tied to the feminine forms is for tadpoles. (In common speech, I do not think people are thinking about the semantic gender of the tadpoles.) Then there is a common sense for a term of affection for small children, since people think of frogs as being cute. There is no semantic gender tied to this sense because even though this is usually used for boys you can use it for either gender. Since grammatical gender is already being used to indicate size and age in other senses, maybe somebody is calling their really big daughter the masculine form (in Punjabi culture, people love fat babies).
  • Proper noun ਰਫ਼ੀ/ਰਫੀ/رفیع (L691272) is an unambiguously male first name, or typically a surname inherited from someone's father as far as semantics are concerned. It has a feminine inflection though, as in Punjabi there are senses for rudely talking about female in-law relatives by inflecting the names of their husbands. This sense has been included on the lexeme for the male name because it does not mean anything without that context. ਰਫ਼ੀਆ/ਰਫੀਆ/رافعہ (L691273) is the actual female form of this name, which has a separate lexeme because these names where each derived individually from Arabic as male and female names. The actual female name is not part of the same lexical unit as the female inflection of the male name.
  • Inanimate noun ਅੰਬ/انب (L677644) means mango, and has both feminine and masculine forms. The feminine forms are used for particularly small mangoes, or particularly small mango plants, or particularly young mango plants. Anything food or agriculture related tends to be much more semantically rich in Punjabi, which means either complex gender/sense relationships like this, or simpler ones where all inflections for gender and number are eliminated as with ਸਾਗ/ساگ (L697738).
Middle river exports (talk) 01:04, 5 September 2022 (UTC)
@Infovarius:, @VIGNERON: The matter is, in the case of the lexeme gato (L34279), all the senses that I mention have the same etymology (although depending on the sense, the forms of one gender or another can be used) and in that case, I think that it would be more appropriate for it to be united in the same lexeme.
@Middle river exports: I like how the issue of genders are resolved in Punjabi. I hadn't thought of that option of using the property subject sense (P6072) to indicate which sense corresponds to each gender, but it seems quite useful. I think that this solution can be applied in the same way to Spanish for those lexemes that are of the same type and have the same etymology, but that change gender depending on the sense. – The preceding unsigned comment was added by Hameryko (talk • contribs).

Coming back to the original question. I'm wondering if the two model are really equivalent. For instance, since grammar has change a lot other the years, I'm interrested to know if we could use detailed recording (like I started on meunière (L25640) - only a beggining obviously) to mesure how "sexist" words are. For instane, when does a word shifted from one meaning (like "miller wife") to an other ("female miller") and when did the masculine shift from "generic masculine" (miller regardless of the gender) to "true masculine" (male miller). This is just an example, I'm not sure if either of the proposed model could answer this question and there is thousands more question. I feel like I don't have enough data to make an enlighten and meaningful review right now (which is a choice for the status quo, it's suboptimal I know but still better then making a wrong choice). We really need more people to pitch in. Cdlt, VIGNERON (talk) 16:51, 9 October 2022 (UTC)

Hello everyone, even though being late to the discussion and without having thought through all of it, some questions touched reminded me of problems discussed in The World Atlas of Language Structures Online. So here are the links to two of its chapters each with many further references, in case they might be useful in the discussion:

Best, --Marsupium (talk) 09:29, 20 October 2022 (UTC)

Obsolete spelling

How to say that свекла (L160967) is an old (and now incorrect) spelling/pronunciation of свёкла (L161458)? Some property at Lexeme level? Or at each Sense? Or try to merge them and create some variant of lemmas at each form? Infovarius (talk) 20:14, 28 September 2022 (UTC)

How about adding an end time (P582) statement to the lexeme for the old spelling, and start time (P580) to the one with the new spelling? I don't see a more speicific property suitable for lexemes only. However, the constraints for end time (P582) should then be changed to allow lexeme type entities (it looks like an unintended omission, since start time (P580) has already been extended in that respect). Both properties can be used either as main values or as qualifiers (the latter could be used with IPA or audio file values on lexeme forms in case just one or a few of the forms have changed pronunciation, or on the item for this sense (P5137) statement if the meaning has changed over time).
Besides the changed properties, the two (or more) lexemes should have identical property/value sets. Add mutual synonym (P5973) links between the variant lexemes to find all spelling variants, past and present.
If the change has been gradual rather than instant, use an appropriate precision on the time value (such as decade or century), with some overlap between old and new. SM5POR (talk) 09:29, 30 September 2022 (UTC)
+1 but I would still merge them ; if the spelling is the only variation, then it's the same lexeme. Cheers, VIGNERON (talk) 15:56, 10 October 2022 (UTC)
If the spelling change is in the root, and therefore appears also in each form, wouldn't merging them result in a pretty complex lexeme with 2 × 13 forms, of which 13 forms will have an end time (P582) statement and the other 13 forms a start time (P580) statement? Imagine a language engine trying to render a piece of text using 19th-century orthography; is it supposed to filter individual forms (and senses) based on the temporal properties? And if you enumerate all the spelling variants of the "same" word since medieval times, can you be sure they are all the "same" lexeme, also when an old word via spelling changes diverges into multiple modern words? God, good, goods? SM5POR (talk) 15:16, 11 October 2022 (UTC)
synonym (P5973) is intended for senses and I would not call these lexemes "synonyms" - I would say they are different "forms" of the same lexeme. I would prefer to link them at Lexeme level, but said to be the same as (P460)/permanent duplicated item (P2959) is for items... --Infovarius (talk) 07:56, 12 October 2022 (UTC)
Good point; I agree and retract my suggestion to use synonym (P5973) in these cases. Still, orthographic changes happen all the time, sometimes to individual words, but at other times to mere phonemes used throughout a written language. In Swedish, a lot of words changed spelling from "e" to "ä" (with no change in pronunciation) in the early 20th century, but later some of those words changed back to "e". A change may apply to all forms of a word, or just some of them. If we are to describe all those transitions within the same lexeme, we may need some qualifiers (temporal, dialectal etc) on the base lemma as well as on each form, but at other times separate lexemes may be the easiest solution (such as when a noun has changed grammatical gender, or when case declinations have been eliminated, leading to a different set of forms). SM5POR (talk) 09:21, 14 October 2022 (UTC)

Phonetic glue

Prompted by @Asaf (WMF) wondering how to model lexeme forms governed by rules about phonetic context (like a (L2767)) there is now a page at Wikidata:Lexicographical_data/Sandhi_rules for collecting examples of where this occurs (per a suggestion from @Nikki). If anybody is interested, it would be helpful to add any bullet points here to take a look at & work out how to indicate this information in a more sophisticated way than linking to entities representing letters. عُثمان (talk) 13:28, 19 October 2022 (UTC)

Where do we conduct discussions (open questions, answers and suggestions)? The corresponding Sandhi_rules Talk page hasn't been created yet; should it be? After P6712 (P6712) had been created there was a discussion at Property_talk:P6712 concerning the choice of data type for this property (three years ago). I'd like to continue that discussion (in short, I'd suggest linking to a phoneme or a sequence of phonemes rather than to a letter), but I'm not sure Property_talk:P6712 is the best place for such a discussion, especially if there are other properties to be considered as well in this subject context and we would like to take a broader approach on the issue.
"How to pronounce ghoti (Q1359881) or translate it into languages other than English"... :-) SM5POR (talk) 05:24, 20 October 2022 (UTC)
There is a talk page on the linked page now, feel free to add to it there. At some point I might ping people who have edited it or expressed interest if a lot of time passes since I know talk pages discussions can be kind of hard to keep track of on here. عُثمان (talk) 22:44, 20 October 2022 (UTC)