Panlingua, by Chaumont Devin, May 8, 1998.

Chapter 3, The Lexicon.

What is a lexicon? There will be many answers, but in the context of Panlingua we have the following definition:

A lexicon is a collection of symbols, lexlinks, and parts of speech. Each lexicon entry is tied to a particular symbol, without regard for morphology. No lexicon entry can ever be a prefix, suffix, or other part of a symbol. Thus the lexicon must contain separate entries for all of the following: eat, eats, eating, eaten, ate, eatable, etc. Each lexicon entry must be linked to at least one semnod, but may be linked to many semnods, in other words must have at least one lexlink but may have more.

Or, from a modular systems design perspective, the lexicon is the gateway through which internal linguistic systems are linked to the outside world. In other words, on a Panlingua-based system, all internal functions could hum along just fine without any lexicon, but it would be impossible for the system to communicate with the outside world in either direction. So Panlingua cannot function without an ontology, but it CAN function just fine without a lexicon.

A symbol is a sound, image, or other physical pattern that represents something else. Spoken and written words are symbols, but the atoms of Panlingua are NOT. Nor are the semnods of the ontology. In the following paragraphs I will generally use "symbol" instead of "word" for spoken or written words. This is because I have also called the atoms of Panlingua "words," and I wish to distinguish spoken and written words from them.

Parts of speech are word classes that seem to have evolved naturally in spoken languages over time. Most languages have such classes as nouns, verbs, adjectives, and adverbs, but there is no guarantee that all the parts of speech found in one language will be found in another. For our purposes, we will adhere to the following definition:

Part of speech is a classification that limits a symbol to a particular set of synlinks and lexlinks.

As an example, let us take "noun." If we know that a symbol is a noun, then we know that this symbol can be linked to its regent (dependency grammar) by a synlink of type agent, patient, noun modifier (noun modifying another noun), etc. Part of speech doesn't tell us which of these synlink types a noun will have in a particular sentence. It only tells us what kinds of synlink a noun CAN have and what kinds of synlink it CANNOT have.

Unfortunately for us, traditional grammars tend to be pretty coarse-grained. For example int he eight or so parts of speech commonly recognized for English, no distinction is made between ordinary nouns and property nouns, such as meekness, propriety, durability, etc. So our English dictionaries fail to tell us whether or not a noun is of this "property" type. Luckily in English one can usually know if a noun is a property noun because of the highly regular property-noun endings: -ness and -ity. Unfortunately for Panlingua implementations such distinctions must appear in the lexicon. Thus for computer implementations, until lexicographers become aware of these finer points of language it may be necessary not only to create more part-of-speech classes than those already established for any surface languages involved, but also to manually enter special lexlink types such as that of "property" for nouns, as explained above.

My reason for classing both Panlingua atoms and spoken or written symbols as words runs as follows:

A word is a linguistic node having one and only one synlink to a regent and one and only one lexlink to a semnod.

When words are defined in this way, both the universal atoms of meaning of Panlingua and the physical symbols of the external world are words. As an example let us take the sentence:

Rust eats iron.

A quick check of the lexicon will tell us a multitude of things. Among others that:

Rust is a noun linked to a semnod linked by synonymy to another semnod linked to corrosion.

Rust is a noun linked to a semnod linked by holonymy to a semnod linked to fungus.

Rust is a verb linked to a semnod linked to rusty.

Eat is linked to a semnod to which another semnod linked to animal is linked by a link of type CAN (animals can eat).

Eat is linked to a semnod linked to corrode.


But as if by magic we immediately know that this kind of rust is not a fungus, and that this kind of eating is not something that animals do. How our brains do this at all, and how they can do this with such great accuracy and speed, is probably the greatest mystery in linguistics today. We know part of how this is done, but only part, and there are almost certainly some very important things we are missing.

Whatever the case, as speakers of English we immediately know that in the case of "Rust eats iron," we have the following two links for each word:

A lexlink links RUST to the semnod linked to CORROSION, and a synlink of type "doer" links RUST to EATS.

A lexlink of type "transitional repetitive" linksEATS to a semnod linked to CORRODES, and a synlink of type "declarative" links EATS to some regent word.

A lexlink links IRON to a semnod linked by hypernymy to METAL, and a synlink of type "patient" links IRON to EATS.

Because we intuitively know these invisible links are there, by the above definition for "word," we are able to call these written and spoken symbols "words."

But as we have seen, Panlingua has no symbols, and consists only of synlinks, lexlinks, and nodes. But the words of Panlingua, also called universal atoms of meaning, have essentially the same synlinks and lexlinks as the ones we can recognize in spoken and written words. Thus Panlingua may be thought of as the purest distillation of language--language without the symbols and linear word order characteristic of texts. Or again Panlingua may be seen as the structural framework upon which all surface languages are formed. And the fortunate thing for US is that Panlingua can easily be modeled in computers using structures representing links and nodes.

Because each word in a sentence like "Rust eats iron" may have more than one potential lexlink and more than one potential synlink, such sentences are said to contain "ambiguity," and to require "disambiguation." The process of disambiguation, or selecting just one and only one synlink and just one and only one lexlink for each word, is also known as "parsing." Thus parsing is the conversion of written or spoken texts to Panlingua representations by means of disambiguation.

The need for a lexicon in parsing is obvious. Without the lexicon there would be no linkage between the external symbols of written and spoken language and the semnods of the ontology, and thus no means of converting these symbols to Panlingua representations. Through the lexicon each external symbol is linked to one or more semnods. During the parsing process, all extraneous lexlinks are culled leaving one, and one of the possible synlinks is selected based on the part of speech returned for the symbol by the lexicon.

The same is true of the process of creating strings of written or spoken words from Panlingua representations. For each Panlingua atom a search is made of the lexicon in order to find an external symbol linked to the same semnod as the Panlingua atom by the same kind of lexlink and having a part of speech that would allow the symbol to be used in a way reflecting its position and role in the Panlingua structure. This process, known as "text generation," is much simpler than parsing because Panlingua structures are essentially unambiguous.

A major problem for the lexicon is that of dealing with compound words. I will not attempt to do a thorough analysis of this problem here, but I will attempt to outline some basic ways of dealing with the problem. Some collocations such as "gray matter," meaning the material of the brain, will admit of no intervening words: There is never any "grey something matter," but only "grey matter." These are the easiest collocations to deal with, and may be entered in a computer lexicon using an underline character or some other connector between the two words. For some others the solution may not be so simple. For example consider the following combinations:

come about comes about coming about came about

For each of these we might also have constructions like:

come slowly and carefully about comes quickly about coming awkwardly about came quietly about etc.

It might prove uneconomical to include each form of "come" with some other word in the lexicon. A better solution might be to create a couple of special semlink types and deal with such collocations in the ontology. The two semlink types I have created for this purpose are MAY and MST, meaning respectively "may have" and "must have." Thus from one of the semnods linked to COME, COMES, COMING, and CAME, a semlink of type MST runs to a semnod linked to ABOUT. Using this system it might then be possible to correctly parse phrases like "she came quickly about," "about she came quickly," etc., with reliable accuracy. Unfortunately I am forced to admit that these improvisations are only kludges that I have set up to deal with the problem. In fact after lengthy consideration I have never been able to find any way of determining what the linguistic apparatus of the human brain does in these cases. I only hope that someday perhaps someone else will be able to figure this out, either by further deduction or by more direct means.

As you have probably noticed by now, this kind of lexicon and ontology arrangement takes the problem of word sense in stride, so I will not elaborate further on this point here.

It also obviates the need for any naming of semnods. This is why I have stressed that Panlingua-based systems have no internal symbols. Each semnod can be identified completely by the words to which it is linked and the link types of the lexlinks that link it to each word. Nothing else is required. But suppose we are using a computer Panlingua-based lexicon/ontology, and we make the following entry at the keyboard:

roses are red

What happens?

First of all, for the computer even to accept this input it will be necessary that "roses," "are," and "red" already be defined. "Are" is a mnemonic operator (see Chapter 2), and not a word but a semlink type. Roses and red have been defined as follows:

rose sng (ROSE is a singular noun. Create a semnod for it.) roses plr * (ROSES is a plural noun. Use same semnod as last entry.) red adj (Red is an adjective. Create a new semnod for it.)

The entry:

roses are red

then causes the following operations:

Find a noun lexlink for ROSES in the lexicon. Find the semnod to which this lexlink connects. Find an adjective lexlink for RED. Find the semnod to which this lexlink connects. Forge a semlink of type ARE from the first semnod to this one.

Then when the following query is entered:

are roses red

for each semnod to which ROSES is linked a search is made to see if a semlink of type ARE exists linking it to another semnod linked to red. If such a semlink is found, the system returns, "Yes." If no such semlink is found, the system returns, "No."

As you may have gathered from the foregoing, the implementation of such systems of links and nodes on a binary computer can be less than straightforward. It may be that someday we will have computers better adapted to this kind of modeling. And yet, even with current computer technology there is no appreciable delay when all these components are linked correctly.

As far as I can understand them at this point, these are the basic components of systems using Panlingua, namely (1) Panlingua itself, (2) the ontology, and (3) the lexicon. To these must be added a parser and text generator for interaction with the outside world. All further components of the human linguistic apparatus, as best I can understand them, are coded in Panlingua.