Page MenuHomePhabricator

Cannot enter multiple forms for the same language variant
Open, In Progress, Needs TriagePublic

Assigned To
None
Authored By
jhsoby
Oct 26 2019, 11:12 AM
Referenced Files
F30893490: Skjermdump fra 2019-10-26 13-03-17.png
Oct 26 2019, 11:12 AM
Tokens
"Like" token, awarded by Marsupium."Like" token, awarded by So9q."Like" token, awarded by Fnielsen."Love" token, awarded by Mahir256.

Description

When editing lexemes, I am not able to add multiple forms with the same spelling variant. However, in Norwegian there are thousands and thousands of words that have several official, accepted spelling variants of various forms, that are interchangeable – which one you use is mostly a stylistic choice.

I include here the conjugation paradigm for the word "parameter" in Norwegian:

indefinite singulardefinite singularindefinite pluraldefinite plural
parameterparameterenparametereparameterne
––"––––"––parametreparametrene
––"––––"––parametere––"––

The ideal (and in my mind correct) way to represent this would be to add them all to the same form – they are functionally identical, and none is more or less correct than another. However, this doesn't work because you – for some reason – can't add several forms with the same spelling variant:

Skjermdump fra 2019-10-26 13-03-17.png (641×947 px, 46 KB)

Event Timeline

I was sitting next to @Lucas_Werkmeister_WMDE right now, and he told me that we can use the same solution that is used for the plurals of English octopus, which is to add the different spellings as new forms with the same grammatical categories. However, this feel very counter-intuitive to me: I feel like the different spelling variants are what should have different forms, and then the various spellings in the same spelling variant should go together in the same form.

To put it in another way, the data model "hierarchy" (or whatever you would call it) is currently:

Lexeme » Form » spelling variant

While I really think it should be:

Lexeme » spelling variant » Form

As a note, this is desperately needed for a number of Indian languages (including Bengali) in which the standardization of spellings according to one standard or another does not preclude other spellings from being considered acceptable (even in the absence of some characteristic unifying the alternative spellings).

Thanks everyone!

@Denny and @daniel: I'd love to hear your thoughts.

@Asaf was this the same problem you were talking about for Hebrew?

Not quite, though it could accommodate the Hebrew case.

For the record, in Hebrew spelling is quite standardized, but there are two different orthographies -- one with diacritics (making matres lectionis unnecessary) and one without (having more matres lectionis to convey vowel information in the absence of diacritics). In modern usage, only poetry and the Bible are written in the orthography with diacritics, whereas everything else (prose books, newspapers, scientific articles) are written in the orthography without diacritics.

A couple of quick examples: the following words are all spelled with the exact same three letters, but the different diacritics combinations give distinctly different meanings:

סֵפֶר (a book)
סְפַר (a frontier/wilderness)
סַפָּר (a hairdresser)
סִפֵּר (he narrated)
סֻפָּר (it was narrated)

In the orthography without diacritics, these five would be rendered, in order:

ספר
ספר
ספר
סיפר
סופר

Note that the first three are indistinguishable in the non-diacriticized orthography, and can only be interpreted in context.

Ideally, one would be able to give the base and inflected forms of each lexeme twice -- once with diacritics, and once without.

I recall that we had long discussions about this when initially deciding on the data model. In technical terms, the question was whether we would allow only a single literal value for a spelling variant, or a list or set of words. Allowing a list or set would enable the kind of flexibility @jhsoby is asking for. But the down side is that it introduces ambiguity when listing forms (you would always have to list all of them, in undefined order), and when generating text (which one should you use)?

If I recall correctly, we decided that we want to give the consumer of the data maximum control over which variant they prefer, by forcing the producer to provide different variant codes for all different spellings. We had discussions about how to encode this in the variant (language) codes, and how to represent it in the UI, but decided to leave that for later.

So, the solution that we envisioned when originally discussing this about four years ago was: you make up a code for each of the spellings, in a way that allows the consumer to choose which variant they prefer. If that is done by encoding a region or a rhyme or a tradition or school or whatever will depend on the language. If it's a stylistic choice, name the style.

The same approach can be used for historical spellings. codes could look something like de-x-hist-nd-15jh or something (this code is totally made up and probably linguistically nonsense).

I suppose this issue was inspired by the discussion after Lydia's talk at WikidataCon 2019: https://media.ccc.de/v/wikidatacon2019-2-wikidata_and_languages#t=2788 . I have now gone back to see how I actually handle there situations for Danish.

It is not possible to add two "Spelling variant"s with the same language code, e.g., "da". I get "It is not possible to enter multiple representations with the same spelling variant."

For the Danish word "aften" https://www.wikidata.org/wiki/Lexeme:L34765 I see that I have put in two forms like the English octopus case above for the case with variants with the same grammatical features.

In Danish dictionaries, the variants can be indicated with parentheses, e.g., "aft(e)ner", "aft(e)nerne", see https://dsn.dk/?retskriv=aften

I now see that the different variants need different property values. Hyphenation might be different and pronounciation may (perhaps controversially) be different: afnen aft‧nen, aftenen af‧ten‧en and aftnen "Afd$n@n (SAMPA), aftenen "Af$d@$n@n. "Describe in" and "Attested in" could be other properties where the property values should be different. So such Danish variants cannot/should not(!?) be represented by what Wikibase regards as "Spelling variant"s.

And so I guess that the Wikibase structure is ok'ish for Danish (and the Norwegian "parameter" example) with the help of the octopus trick.

Perhaps a property could be created that links forms that a spelling variants, e.g., the different forms of parameter, octopus and aften.


A slightly different issue arises for the words mor/moder and far/fader (mother and father in Danish). Here mor and moder should probably be regarded as two separate lexemes, - one (mor) that is derived as a short form of the other (moder). However, their plural forms are the same (mødre=mødre), so now there are two "mødre" forms: L36819-F3=L184691-F3

Nearly Vietnamese lexeme would be affected by this issue, because one of the two writing systems for the language is phonetic while the other is phonosemantic, resulting in a many-to-many relationship between the two writing systems.

So, the solution that we envisioned when originally discussing this about four years ago was: you make up a code for each of the spellings, in a way that allows the consumer to choose which variant they prefer. If that is done by encoding a region or a rhyme or a tradition or school or whatever will depend on the language. If it's a stylistic choice, name the style.

This isn’t always possible. Vietnamese chữ Nôm is unstandardized, so a single author may use multiple characters interchangeably for the same word (with the same pronunciation and meaning). There isn’t any “style” to speak of: an author’s choice to use one character for “and” has little if any bearing on their choice of character for “or”. If Wikidata were in existence a century or more ago, we might’ve chosen to create separate a separate lexeme for each Nôm character, in which case it might be possible to model quốc ngữ spellings as dialectal representations of Nôm forms. But in the 21st century, Nôm characters must be subordinate to quốc ngữ words.

The ideal solution would be to allow (in the language code validator) arbitrary language codes including a rank identifier. For instance, for Viatnamese one should be able to use codes such as vi-x-Q8201-1, vi-x-Q8201-2 etc. Currently this doesn't pass the validation as one gets the error Invalid Item ID "Q8201-1".

In T236593#8015993, @AGutman-WMF wrote:

The ideal solution would be to allow (in the language code validator) arbitrary language codes including a rank identifier. For instance, for Viatnamese one should be able to use codes such as vi-x-Q8201-1, vi-x-Q8201-2 etc. Currently this doesn't pass the validation as one gets the error Invalid Item ID "Q8201-1".

It sounds like representations need the ability to have qualifiers…

In Danish, we are currently using multiple forms and linking them with https://www.wikidata.org/wiki/Property:P8530 See also the discussion at https://www.wikidata.org/wiki/Wikidata:Property_proposal/Alternative_form

I think this is not ideal either, because it would mean validating by Lexical Masks becomes more difficult. I would argue that in such cases the right way is to use a distinct language codes for the variant spelling (possibly arbitrarily selecting one as the variant), as I did for demonstration on https://www.wikidata.org/wiki/Lexeme:L229388 (though I only did it on the lexeme header, one could use the same system for each of the forms). If the pronunciation is different as well, I think they should be seen as different lexemes.

In T236593#8015993, @AGutman-WMF wrote:

The ideal solution would be to allow (in the language code validator) arbitrary language codes including a rank identifier. For instance, for Viatnamese one should be able to use codes such as vi-x-Q8201-1, vi-x-Q8201-2 etc. Currently this doesn't pass the validation as one gets the error Invalid Item ID "Q8201-1".

It sounds like representations need the ability to have qualifiers…

To elaborate, each Nôm character needs a different set of Han character in this lexeme statements (multiple statements for compound words), different sources, probably other things that aren’t coming to mind. It’s not that I don’t want to give the multiple-representation approach a try, but how else would hủy bỏ/huỷ bỏ and ký hiệu/kí hiệu be modeled but to keep the characters in separate forms?

In principle, each character should even get its own lexeme, but since each Nôm character is an alternative form of a quốc ngữ word, the various spellings of that word would need to be duplicated as lemmas of each such lexeme. It ends up being a lot of redundancy and room for error, even though most consumers of Vietnamese lexemes will only care about the quốc ngữ spellings, not the Nôm characters. I had tried this approach at one point, with very redundant lexemes for phở, 𬖾, and , but it seemed like needless complication for both editors and data consumers.

@mxn If these are purely orthographic variants (i.e. the pronunciation is the same) I would list them under a single lexeme. And in that case, the most natural way would be to list them as spelling variants rather than distinct forms.

To attach statements to specific variants, I believe that you can qualify statements using the "subject form" property (although, aside, I must admit I don't understand the need for the "Han character in this lexeme" property; what novel information does it bring on top of the orthography itself?)

I have entered this Danish lexeme today: https://www.wikidata.org/wiki/Lexeme:L348129. In authoritative works https://ordnet.dk/ddo/ordbog?query=m%C3%B8rkel%C3%A6gge&search=Den+Danske+Ordbog , https://dsn.dk/ordbog/ro/moerkelaegge/ and https://ordregister.dk/ they are regarded as one lexeme. The Ordregister has one identifier for the lexeme and we have different Ordregister form identifiers for each of the variants. The hyphenation is different between the variant.

@Fnielsen given that the pronunciation of these forms is in fact different (according to the X-Sampa notation), and each has its own distinct inflection set, I would treat these as two distinct (synonymous) lexemes. I don't see the advantage of lumping all these forms in one entry. Of course, in a dictionary intended for human-consumption it is convenient to list them together, but in a machine-readable dictionary, such as Wikidata, these should really be treated as two distinct lexemes.

AGutman changed the task status from Open to In Progress.Jun 24 2022, 2:14 PM

I'm working on a patch to allow multiple forms associated with the same private language code.

@AGutman-WMF https://www.wikidata.org/wiki/Lexeme:L348129 does have the same inflection. The Ordregister is presumable also for machines and lumps forms together. For instance, https://ordregister.dk/id/COR.53473/ corresponding to https://www.wikidata.org/wiki/Lexeme:L250372 lumps 6 different forms together in the lexeme. And each as a separate COR identifier.

@Fnielsen as far as I see, each variant spelling forms its own set of inflected forms, so you have a paradigm related to mørklægge and another paradigm related to the variant spelling mørkelægge. So conceptually you don't have a single list of forms, but rather two distinct lists of forms. For this reason (and since the pronunciation slightly differs) it may make sense to separate them to two distinct lexemes.

However, if you want to follow the system of the Ordregister, please note that the identifiers have 3 parts: the first corresponds to the "lexeme", the second one to the "inflectional form", and the third one to the "spelling variant" level.
So if you want to follow this system, you shouldn't list the spelling variants as separate forms, but rather as spelling variants of the same forms. If you want to attach statements to the spelling variants, you could use the "subject form" property, as I suggested above.

In T236593#8025472, @AGutman-WMF wrote:

@mxn If these are purely orthographic variants (i.e. the pronunciation is the same) I would list them under a single lexeme. And in that case, the most natural way would be to list them as spelling variants rather than distinct forms.

This assumption is only valid in an environment with purely phonetic/alphabetic writing systems. But in Chinese, two characters that are “spelled” distinctly but carry the same semantics and pronunciation would still have distinct lexemes. This also makes it possible to indicate that the two characters are pronounced similarly in one dialect but differently in another.

Chữ Nôm is a Chinese-based writing system that adds a phonosemantic aspect. If not for its relationship to the quốc ngữ alphabet, every character would clearly get its own lexeme, just like in Chinese. Any similarity in pronunciation would be irrelevant, because this writing system makes finer semantic distinctions than any alphabet would. For example, the difference between 𬖾 and 頗 (both interchangeable written forms of phở) is that 𬖾 combines 頗 with the component 米 as a disambiguator, clarifying that it has to do with rice (because phở noodles are made of rice), as opposed to whatever 頗 originally meant in Chinese. This is only one of many possible ways in which characters may be used interchangeably but can carry different nuances. Yet all this is secondary to the fact that the two characters are equivalent to phở, which makes no such distinctions.

To further illustrate the difficulty, if you look at a quốc ngữ–to–chữ Nôm dictionary and a chữ Nôm–to–quốc ngữ dictionary by the same author, the entries will not line up, just as there isn’t a one-to-one correspondence between the English-to-German and German-to-English halves of an English–German dictionary. If you look up “bỏ” in this dictionary, you’ll get three characters from the source “vhn” corresponding to two different senses of bỏ. Any Vietnamese dictionary would have just one entry for these two senses of bỏ, because Vietnamese speakers no longer illustrate semantics in writing.

If it is so important that forms not be used for orthographic variants of a non-alphabetic writing system, then the alternative approach would be to store the quốc ngữ and chữ Nôm representations in separate lexemes, as though they’re different languages. We could link individual quốc ngữ and chữ Nôm senses together as translations. This would be broadly consistent with the approach taken on every Wiktionary and render this ticket moot for Vietnamese, but it bends the definition of a language quite a bit.

To attach statements to specific variants, I believe that you can qualify statements using the "subject form" property

This is for statements on senses. If we somehow combine all the Nôm characters into a single form, then it would make sense to qualify sources and P5425 statements by a “applies to representation” property, but even this would get messy with compounds.

(although, aside, I must admit I don't understand the need for the "Han character in this lexeme" property; what novel information does it bring on top of the orthography itself?)

Translingual data about a Han character is stored in an item. There’s a need to connect this translingual data to individual senses via language-specific forms.

@AGutman-WMF Spelling variants in Ordregister are each associated with a specific identifier. If the spelling variants are just a representation, then it is not possible to associate the identifier with the specific representation (unless a new property is proposed). On the other hand if the spelling variant is associated with a separate form then the Wikidata property can be used for the Ordregister spelling variant identifier.

If it is so important that forms not be used for orthographic variants of a non-alphabetic writing system, then the alternative approach would be to store the quốc ngữ and chữ Nôm representations in separate lexemes, as though they’re different languages. We could link individual quốc ngữ and chữ Nôm senses together as translations. This would be broadly consistent with the approach taken on every Wiktionary and render this ticket moot for Vietnamese, but it bends the definition of a language quite a bit.

I’ve implemented this approach, so this feature request is no longer of great importance to Vietnamese. A side benefit is that it’s now possible to say that a Nôm character is a “translation” of some senses of a quốc ngữ word but not others (because of semantic distinctions that were only necessary to indicate in chữ Nôm).

I recall that we had long discussions about this when initially deciding on the data model. In technical terms, the question was whether we would allow only a single literal value for a spelling variant, or a list or set of words. Allowing a list or set would enable the kind of flexibility @jhsoby is asking for. But the down side is that it introduces ambiguity when listing forms (you would always have to list all of them, in undefined order), and when generating text (which one should you use)?

If I recall correctly, we decided that we want to give the consumer of the data maximum control over which variant they prefer, by forcing the producer to provide different variant codes for all different spellings. We had discussions about how to encode this in the variant (language) codes, and how to represent it in the UI, but decided to leave that for later.

So, the solution that we envisioned when originally discussing this about four years ago was: you make up a code for each of the spellings, in a way that allows the consumer to choose which variant they prefer. If that is done by encoding a region or a rhyme or a tradition or school or whatever will depend on the language. If it's a stylistic choice, name the style.

The same approach can be used for historical spellings. codes could look something like de-x-hist-nd-15jh or something (this code is totally made up and probably linguistically nonsense).

The underlying assumption behind this decision is that, different spelling forms must be associated with certain variant, or that there are some of the spelling being preferred over other spellings, or that some spelling is more commonly used for some spoken variant/sociolet/etc than others and is other spelling.

None of these are correct assumption, when it come to non-Chinese languages that use Chinese characters, or even some Chinese languages that need to apply Chinese characters.

Example of Vietnamese chu nom have already been presented above. Other examples includes Japanese ateji when Kanji are used for Japanese native words except cases where there have been full established transliteration, and its Korean equivalent in history, as well as in languages like Cantonese when non-Mandarin words need to be expressed in Chinese characters.

I've now created a patch that does allow associating several spelling variants with the same private language code.

If the patch gets merged, it will allow associating spelling variants of forms or lexemes with codes like da-x-Q123-1, da-x-Q123-2 (on top of an existing da-x-Q123) etc. In other words, the same private-use language code (qualified by some Q-id) can be reused with an arbitrary integer number following it. These numbers may represent some order of preference (for instance, according to frequency, or word-length), but they can also be arbitrary if no distinguishing criteria is provided. I believe such a representation of variant spellings is better than duplicating the form with its grammatical features and other statements. If a statement should apply only to a specific spelling variant, it is possible to qualify it (using for instance the Subject Form property or some other tailor-made property).

I believe this solution would solve the initial problem associated with this ticket ("Cannot enter multiple forms for the same language variant"), but is this still of relevance?

I believe the current situation, where multiple forms are added to account for spelling variations goes against the spirit of the lexicographical data model, and in particular the idea that there should be exactly one form for each combination of grammatical features. Therefore I think it is important to unblock this situation, and I think my proposal is a simple way to go forward.

@mxn @Fnielsen @jhsoby @Ijon @daniel do you mind to chime in regarding this?

@daniel - this would work very well for Hebrew, for example, where the two orthographies have a formal name known to all speakers, but less well when the variations are due to lack of standardization, as in the Bangla case mentioned by @Mahir256.

@AGutman-WMF - yes, I think your approach makes sense. It would be good to auto-suggest those custom language codes in data-entry.

It’s still not clear to me which problem the -x-Q123-1 patch is trying to solve. Several languages have been mentioned in this task, but which of them would benefit from this system? I feel like for several of them, we’ve already reached the conclusion that separate forms are in fact the way to go.

I’d like to extract a general rule from @Fnielsen’s comment above (T236593#5610903): if you need separate statements, then you need separate forms or lexemes. (I think this is a sufficient condition, though it might not be a necessary one.) Pronunciation (whether pronunciation audio or IPA transcription) is probably the most significant kind of statement here: if a speaker would pronounce the spellings differently, then they should be different forms – regardless of whether the difference is a completely different ending as in octopuses/octopi, or just an extra schwa as in aft(e)nen. (I don’t find the hyphenation example as convincing… don’t you need a different hyphenation for every spelling variant, even for cases that really should just be multiple representations of one form? E.g. co‧lor/co‧lour – that could just be multiple statements on the same form, with different monolingual text language codes.)

I suspect this rule covers the Norwegian example that originally motivated this task: I feel like “parametere” and “parametre” are probably pronounced differently, much like “aftnen” and “aftenen” are pronounced differently in Danish according to Finn. For Vietnamese chữ Nôm, I feel like @mxn’s comment at T236593#8024999 goes in a similar direction, though I admit I find the whole Chinese-characters part of this discussion hard to follow.

For the cases where you really only want to have one form with multiple representations, I still agree with @daniel’s comment (T236593#5610378): “you make up a code for each of the spellings”. In practice, the only way to “make up a code” that we currently support is to append -x-Q12345 to an existing, established language code. As far as I understand, this solution works well for Hebrew: e.g. ספר/סֵפֶר (L67105) (the “book” word) uses the language codes he and he-x-Q21283070, where Q21283070 represents Tiberian vocalization, the orthography with diacritics. At some point, an editorial decision was made that the spelling without diacritics “deserves” the unsuffixed he language code (instead of both spellings using an -x-Q12345 language code), which I think is reasonable: data reusers who don’t care about the different spellings can use the most standard language code (he) and its single representation per form.

Allowing people to append an integer number to the item ID adds a second way to make up a code, and one that seems less useful to me: without knowing what the number means, how do I know which form representation to use? To me this runs counter to the goal of “allow[ing] the consumer to choose which variant they prefer”. For the languages that appear to need multiple representations for the same language code per form (e.g. the Indian languages @Mahir256 mentioned in T236593#5608530?), is it not possible to make the item ID approach work, by creating more special-purpose items? Wikidata editors would then make a decision which of the possible spellings “deserves” the standard language code, and which additional items need to be created (“spelling with character X”, “spelling with sequence Y”?). I understand that not all languages have standardized spellings where you can use a single item ID to refer to the spelling variants of a wide range of lexemes (like in Hebrew), but I think it should still be possible to describe different spellings using items that carry more meaning than just a number.

It’s still not clear to me which problem the -x-Q123-1 patch is trying to solve. Several languages have been mentioned in this task, but which of them would benefit from this system? I feel like for several of them, we’ve already reached the conclusion that separate forms are in fact the way to go.

I’d like to extract a general rule from @Fnielsen’s comment above (T236593#5610903): if you need separate statements, then you need separate forms or lexemes. (I think this is a sufficient condition, though it might not be a necessary one.) Pronunciation (whether pronunciation audio or IPA transcription) is probably the most significant kind of statement here: if a speaker would pronounce the spellings differently, then they should be different forms – regardless of whether the difference is a completely different ending as in octopuses/octopi, or just an extra schwa as in aft(e)nen. (I don’t find the hyphenation example as convincing… don’t you need a different hyphenation for every spelling variant, even for cases that really should just be multiple representations of one form? E.g. co‧lor/co‧lour – that could just be multiple statements on the same form, with different monolingual text language codes.)

I suspect this rule covers the Norwegian example that originally motivated this task: I feel like “parametere” and “parametre” are probably pronounced differently, much like “aftnen” and “aftenen” are pronounced differently in Danish according to Finn. For Vietnamese chữ Nôm, I feel like @mxn’s comment at T236593#8024999 goes in a similar direction, though I admit I find the whole Chinese-characters part of this discussion hard to follow.

For the cases where you really only want to have one form with multiple representations, I still agree with @daniel’s comment (T236593#5610378): “you make up a code for each of the spellings”. In practice, the only way to “make up a code” that we currently support is to append -x-Q12345 to an existing, established language code. As far as I understand, this solution works well for Hebrew: e.g. ספר/סֵפֶר (L67105) (the “book” word) uses the language codes he and he-x-Q21283070, where Q21283070 represents Tiberian vocalization, the orthography with diacritics. At some point, an editorial decision was made that the spelling without diacritics “deserves” the unsuffixed he language code (instead of both spellings using an -x-Q12345 language code), which I think is reasonable: data reusers who don’t care about the different spellings can use the most standard language code (he) and its single representation per form.

Allowing people to append an integer number to the item ID adds a second way to make up a code, and one that seems less useful to me: without knowing what the number means, how do I know which form representation to use? To me this runs counter to the goal of “allow[ing] the consumer to choose which variant they prefer”. For the languages that appear to need multiple representations for the same language code per form (e.g. the Indian languages @Mahir256 mentioned in T236593#5608530?), is it not possible to make the item ID approach work, by creating more special-purpose items? Wikidata editors would then make a decision which of the possible spellings “deserves” the standard language code, and which additional items need to be created (“spelling with character X”, “spelling with sequence Y”?). I understand that not all languages have standardized spellings where you can use a single item ID to refer to the spelling variants of a wide range of lexemes (like in Hebrew), but I think it should still be possible to describe different spellings using items that carry more meaning than just a number.

As an English example, some religious people might refuse to write the name "God" out directly as it is as this would constitute idolatry. For this we can tag it as en-x-Qnnnn for which Qnnnn refer to religious group of people, but there are more than one alternative way to write "God". They can either write "G-d", "G*d", "G_d", "G-o-d", and so on. It would make no contextual differences in whether a hyphen or a underscore is being used, and the change in which exact symbol being used in place of original alphabet wouldn't affect pronunciation or religious connection. Hence all of these alternatives should be tagged en-x-Qnnnn, and with the patch it would be possible to have "en-x-Qnnnn-1" being "G-d" while "en-x-Qnnnn-2" being "G*d". I can't see how more specific labels can be useful in differentiating "G-d" and "G*d"

@LucasWerkmeister I agree with you that if two variants have two different pronunciation, they should probably be split into two different lexemes (in general, I think we should avoid having multiple forms with the same grammatical features within one lexeme). There is some leeway, however, in this rule, since different dialects may have slightly different pronunciations which we still want to group into a single lexeme/form. For instance American English "color" and British English "colour" are in fact pronounced slightly differently, but it would be over-kill to split them, since the difference in pronunciation is systematic between the dialects.

Moreover, I agree that in general we should qualify variant spellings by a meaningful identifier (and indeed, my proposal requires this, as the integers can only qualify already Q-qualified language codes), but as @C933103 mentioned above, there are situations where there is no meaningful way to qualify two spellings (or at least, the editor haven't thought of such a qualification yet). These integer qualified codes allow in these cases to list the variants as spelling variants nonetheless, instead of adding spurious forms or lexemes. If in a future point a more meaningful qualification would be found (e.g. maybe use Q209316 for spellings with added e-s in Norwegian?), the codes can easily be altered, while restructuring spurious forms as spelling variants is more difficult.

I apologize if I missed something, but if we do end up separating into different *lexemes*, how do we retain the value of all the descriptive work done on one lexeme (presumably the more common or standard form) that equally-well describes the form in the other lexeme? Do we rely on some sameAs property and then on applications and re-users to consider that property and auto-merge/import statements from the other lexeme?

To give a concrete example, if the rich lexeme currently at https://www.wikidata.org/wiki/Lexeme:L189 were split, how would we make sure the sample sentences, etymology, etc., would be discoverable from the other lexeme?

To my mind, that's the main disadvantage of any solution that would involve separating into multiple lexemes.

@Asaf Insofar two forms are considered distinct lexemes, it is probably the case that not all statements hold for both forms (e.g. the pronunciation may be different, and possibly other details such as etymology). If the two forms are close enough (e.g. just minor dialectal pronunciation details), then we may indeed lump them together in one lexeme as if there were spelling variants (and then my suggested patch may become relevant). Even if we decide to split them, we may of course link the two lexemes to each other, using various properties such as "synonym of" or "derived from" etc. Anyhow, my suggested patch would allow more easily to lump together such variants, as it allows re-using the same basic language code for several spelling variants.

As an English example, some religious people might refuse to write the name "God" out directly as it is as this would constitute idolatry. For this we can tag it as en-x-Qnnnn for which Qnnnn refer to religious group of people, but there are more than one alternative way to write "God". They can either write "G-d", "G*d", "G_d", "G-o-d", and so on. It would make no contextual differences in whether a hyphen or a underscore is being used, and the change in which exact symbol being used in place of original alphabet wouldn't affect pronunciation or religious connection. Hence all of these alternatives should be tagged en-x-Qnnnn, and with the patch it would be possible to have "en-x-Qnnnn-1" being "G-d" while "en-x-Qnnnn-2" being "G*d". I can't see how more specific labels can be useful in differentiating "G-d" and "G*d"

I don’t follow this example. If you think all of these potential forms are significant, and all of them should be tracked in Wikidata, then why do you want to combine them all under a single item ID where nobody can tell them apart? To me it makes more sense (assuming this data is notable at all) to have separate items like “bowdlerized using hyphens”, “bowdlerized using asterisks”, etc., which can be subclasses of a more general “avoiding idolatry” item, have other statements indicating which character is being used, and so on. (“Bowdlerized” definitely isn’t the right word here, but I don’t know what the right word is, sorry.)

In T236593#8097326, @AGutman-WMF wrote:

@LucasWerkmeister I agree with you that if two variants have two different pronunciation, they should probably be split into two different lexemes (in general, I think we should avoid having multiple forms with the same grammatical features within one lexeme). There is some leeway, however, in this rule, since different dialects may have slightly different pronunciations which we still want to group into a single lexeme/form. For instance American English "color" and British English "colour" are in fact pronounced slightly differently, but it would be over-kill to split them, since the difference in pronunciation is systematic between the dialects.

That’s fair, and I actually almost wrote “if the same speaker would pronounce them…” in my comment :) I’m not sure how exactly to phrase the rule, but mainly I’m glad to have found some rule at all (which I’m not sure I really understood, at least consciously, back in 2019 when I was apparently sitting next to @jhsoby).

This may be verging on pedantry, but I will say that the principle of "one form per combination of grammatical features" does not sound broadly applicable enough to follow for each language. Maybe I am missing something and this is just a convention for certain languages.

In any case, here are some examples which illustrate where this would not be a helpful model. In Punjabi, an alternate form with identical grammatical features could represent any combination of the following:

  • An alternative pronunciation of the same form, represented by mutual "alternative form" property links without mutual "homophone form" links
  • An alternative spelling of the same form in any or all of the spelling variants/orthographies represented, represented by mutual "alternative form" links and mutual "homophone form" links.
    • If the the spelling varies only for one representation--which actually is not as common as I initially expected--the other representation(s) are duplicated exactly. This may seem somewhat tedious, but for the time being it is an effective way to store the useful information that where spelling varies in one writing system, only one spelling is accepted in the other.
  • Dialectal or regional variants of the same form, most often simply indicated with "variety of form" set to "unknown value," as usually no empirical evidence exists to assign the form to a specific named dialect or say anything more specific than "this form will vary depending on who you talk to."
  • Shortened or contracted variants of the same form, indicated with mutual "alternative form" property links and "short form" as a grammatical feature on the shorter form.
  • Versions of forms which are only for use in spoken language / dialogue as opposed to versions of forms which are only used in writing. For example, for some forms on a Punjabi verb, the form will get inflected twice for grammatical number and/or person, once on an infixed part of the form, and once on the suffixed ending of the form, but in spoken/colloquial language it is acceptable to use a form which is only inflected once.

Notably all of the above will only apply to particular inflections of a given lexeme. If we take this verb for example, https://www.wikidata.org/wiki/Lexeme:L688582 , there are 30 forms with "alternate forms" that share grammatical features with another so far out of the 99 forms documented. If we were to create 30 separate lexemes to represent this 1 word, how would we represent the rest of the context that is important for understanding what these inflections represent, or indicate for example that ਹਸਾਏਂਗੀ and ਹਸਾਵੇਂਗੀ are interchangeable spelling + pronunciation options for second person + feminine + singular + additive + causative + subjunctive + definite, but that only ਹਸਾਵਾਂਗੀ is acceptable as a spelling + pronunciation option for first person + feminine + singular + additive + causative + subjunctive + definite? On other lexemes, the same grammatical feature combination may permit variation. (This is ultimately governed by a rule about the final phoneme of the root in a verb which only ever applies to the gender-inflected, written/formal first person subjunctive definite forms.) That would be an unsustainable model. I am relatively conservative about what constitutes a separate lexeme; I tend to base it primarily on a combination of part of speech + mode of derivation rather than pronunciation or spelling variation, especially since the latter factors generally don't have any bearing on how and where a lexeme can be used according to the internal logic of the language.

I am inclined to agree that the numbered Q-item language code patch is hard to discern the specific purpose. I think what may be the case here is that each of the concerns brought up in this thread have different solutions. Theoretically, there is no upper limit on the number of variations a form can have, and it could become confusing if languages started to have long vertical strips of representations, some of which are governed by a consistent heuristic, and some of which are arbitrary. What may be productive is the addition of various properties for use on lexeme forms which offer more nuanced ways to model the different languages discussed here.

The approach I had to take with Vietnamese (separate lexemes per word per writing system, “translations” from one writing system to another) does have some downsides. For one thing, the criteria for a translation between vi and vi-Hani must be stricter than the criteria for a translation between vi and en; otherwise there would be no way to distinguish these transcriptions from translations more generally. In principle, it would follow that every simplified Chinese character should also have a separate lexeme from the corresponding traditional character(s), as on Wiktionary, and we could even take this to the extreme that “colour” is the en-GB “translation” of “color” in en-US. Maybe this wouldn’t be such a slippery slope with a dedicated “transcriptions” or “readings” property alongside the “translations” property, but such a property would be less discoverable by data consumers.

On a practical level, this separate lexeme approach means any Wiktionary template similar to https://en.wiktionary.org/wiki/Template:vi-readings would need to look up translations, while a template generating a table of translations of an English sense would need to know to ignore vi-Hani statements or merge them with vi statements. In a Vietnamese dictionary, it’s also normal to list the other words represented by the same characters. As things stand, such a listing cannot be automated on Wiktionary using the Wikibase Scribunto ectensions; it requires a SPARQL query. Even if we were to revive this rejected property proposal, the template would need to make a series of expensive calls to build up the table.

It would be nice to be able to more strongly link representations in the two Vietnamese writing systems, but allowing multiple representations to have the same language code would only be a partial solution anyways. A full solution would be able to limit some statements to certain representations of a form. Otherwise, how would one indicate that one representation is now rare, having been supplanted by the other, independently of any broader linguistic shift, or that two sources disagree about whether that change has even occurred?

For now, I’m sticking to the separate lexeme approach because it affords more flexibility. If the data model evolves so that a consolidated lexeme becomes more workable, merging lexemes will be easier than splitting them apart anyways.