Panlingua, by Chaumont Devin, May 10, 1998.
Chapter 8, Machine Translation.
But how, you may ask, can Panlingua be used to translate between languages? Here are the steps required to translate from Language A to Language B:
1. Create separate lexicon-ontologies for Language A and Language B.
2. Create a translation table linking the semnods of Language A to the semnods of Language B.
3. Create the ad hoc algorithms to convert Panlingua representations using the semnods of Language A to Panlingua representations using the semnods of Language B.
4. Create a parser for Language A and a text generator for Language B.
It may be easy to understand why separate lexicons will be required for Language A and Language B. After all, every language employs a different set of symbols for communication. But you may wonder why the two ontologies as well. Wouldn't it be better to create a single ontology to handle both languages? After all, the external symbols of language may differ, but aren't things still the same things, and aren't the states these things can be in still the same states? It may well be possible to do this, at least for a limited number of languages at a time, especially if those languages belong to people of similar cultures. But cultures tend to differ dramatically, so that what is considered a good thing by the speakers of one language may be considered to be a terrible thing by the speakers of another language, etc. For example, in the Moluccan Islands of Indonesia, the idea of rain falling while the sun is shining may strike terror to the heart. When this happens, the devil comes out looking for victims to curse with vile afflictions. So "hujan panas" is a very bad thing. But in the Hawaiian Islands many people love the misty showers of manoa because they are thought of as sweet and refreshing, and often come in the company of brilliant rainbows, so in the Hawaiian islands such a thing would be a pleasant, sweet, and good thing. And in the Moluccan Islands people might say, "Let's cross the street. The light is blue." because to many people in the Moluccas blue and green are the same thing, and this was so even before biagara. But if you said the street light was blue in Hawaii people would think you were complaining about the color of the light. And while in Hawaii a basket is just a basket, I have never been able to find a word for just "basket" in Maluku. A basket is never just a basket, but always some kind of a basket, perhaps a "bakul," or a "keranjang," or a "kamboti," etc.
For some of these difficulties the solutions would seem simple enough. For example if in the Moluccas green lights are blue lights, then all we would have to do is make the lexlink from the word, "green," link to the same semnod as the lexlink from "blue," and all would be well. Well, that is, until we tried to translate from Moluccan Malay into English. Then we would be faced with a problem, because "blue" and "green" link to separate semnods in English, so that a choice would have to be made, and this would require special processing. If Both Moluccan Malay and English were made to share the same ontology, then besides a semnod linked to the English word, "green," and another linked to "blue," there would have to be another one that linked to the equivelants of both these words in Moluccan Malay. But why do both of these words even exist in Moluccan Malay? Why not just some word like "greenblue?" I cannot answer this question, but for some strange reason they do, and this is the reality with which we must deal. As you can see from this example, having a common ontology, even for two languages, will necessitate additional nodes. This means that if an ontology were to be developed to include all the semnods of all the languages known to man, then this ontology might have to be very large. I am not saying that it would have to be larger than the capacity of a modern computer, but it would have to be large.
And for the basket case, if both languages shared the same ontology, then there would simply be no lexlink to the semnod for English "basket" from Malay. So far so good. But what happens to the hypernym links of the common ontology? To the American fellow, a "bakul" is a basket, and a basket is a container, but to the Moluccan guy a "bakul" is just a container. Can we mark semlinks for language? If we could, then the Moluccan semlink from the thing called a "bakul" would jump directly to the semnod for "container," but the English semlinks would go from "bakul" to "basket" to "container" (notice that I have gotten lazy and stopped writing "the semnod linked to" before each word). Now things are getting uncomfortable, because it suddenly becomes necessary to change all our software to make it handle different semlinks for multiple languages, and this would seem to be an instant nightmare.
I will not even bother with the problem of sunny rain, because in both English and Moluccan Malay this requires more than just a single semlink because in both languages "sunny rain" requires two words. In short, it can be seen that common ontologies may not be the right way to go.
And in fact if we return to the human model, it is probable that common ontologies are not being used even there. People who speak many languages fluently often seem to adopt whole different personalities for each language they speak. They have found that they fit into a certain personality type for English, another for French, another for German, etc. The link between personality and ontology is clear. The ontology holds all the key perceptions of the individual about his/her world.
It may well be that people have a common ontology for many "core" relations, such as "houses are big," "kittens are tiny," etc., and then use limited language-specific ontologies for certain special things. This might seem to be a logical approach, but I have failed to recognize any evidence for it thus far.
So the safest and easiest approach must be to create two ontologies and a translation table. Each entry in this table would then be a link from a semnod of Language A to a semnod of Language B. Ideally these links would be bidirectional, so that translation would be possible in both directions, but for simplicity in this chapter I will assume that they are directed from Language A to Language B. Each such link would then have the following components: source semnod of Language A Destination semnod of language B special function identifier
Because we are translating from Language A to Language B, the identity of the source semnod, which is of Language A, will always be known. In most cases, the destination semnod will also be known, because the two languages will have more semnods that map cleanly than those that do not. A special function may or may not be given in cases where the destination semnod is known, but definitely must be provided where no destination semnod is known. In cases where a special function is provided, this special function will be part of the ad hoc code (item #3). The function will be called with the address of the current translation link and the current word in the source Panlingua representation and its regent as arguments. It will be designed specifically to deal with the problems known to arise when trying to translate this semnod, and will generate a circumlocution in the target Panlingua representation, or do anything else that may be required to make the translation work well. The farther Language A is from Language B, obviously, the greater the number of such ad hoc functions that will be required. Thus the number of ad hoc functions required to translate from English to Japanese would probably be very large, whereas the number of such functions required to translate from English to Dutch or German would be very small.
But although the ad hoc functions required for translation will not be easy to write, those required for parsing will be much worse. The parser is by far the most difficult component to develop for any such system. The problem, of course, is that texts are ambiguous. The texts of surface languages employ strings of symbols to represent Panlingua atoms for communication. Each such symbol is linked to an entry of a lexicon for that language, and for each such symbol the lexicon may provide many potential synlink-lexlink pairs. In order to parse sentences, all but one of the potential synlink-lexlink pairs for each word must be discarded, and the destination of the remaining synlink must be determined. After this has been accomplished the original symbol for the word as well as its linear position in the word sequence can also be discarded. At this point the word has been parsed, and is represented in Panlingua.
For text generation, a grammar of the target language is used to rearrange the Panlingua atoms in a linear sequence. Then the lexicon is consulted to find a symbol with which to replace each Panlingua atom, and the process is complete.
Of course these are oversimplified descriptions of parsing and text generation, but they do tell the basic steps that must be taken. There must be some gimmick in the parsing process that we are missing completely, but so far no one has ever been able to see it. Thus for a single programmer to write even a moderately reliable parser using current knowledge may take years. No one has ever been able to write a parser that can even come close to human performance, and thus it is very difficult to get materials converted from surface languages to Panlingua. Clearly this issue of parsing remains the major impediment to progress in the field of artificial intelligence today, because without getting materials into Panlingua it is virtually impossible to do anything really meaningful with them automatically using a computer. But a vast new horizon awaits us once this difficulty has been overcome.