Drucken

Virtuelles Handbuch Informationswissenschaft

Heinz-Dirk Luckhardt

Approaches to sense disambiguation with respect to automatic indexing and machine translation

CONTENTS

Main page
1. Introduction
2. The general linguistic approach
3. The morpho-syntactic approach to automatic tagging
4. The sublanguage approach: how can different special domains be dealt with?
5. The semantic relations approach: towards a semantic interlingua
6. The semantic (text) knowledge approach: classification and thesauri and their use in NLP
References

The morpho-syntactic approach to automatic tagging

When we 'tag' a text we give every word in the text its grammatical description, i.e. we select - by determining its function in the present sentence - the correct reading from a number of readings a word form may have; we disambiguate it. This was formerly done intellectually, now there are many tools that do it automatically.

One of these tools might be any parsing component of a machine translation system, because disambiguation is the foremost task of any of these parsers. It may be achieved with morphological, syntactic, or semantic means. The approach to disambiguation shown here is based on morphology and syntax (partial parsing) and has been employed in the SUSY MT system. The SUSY system has, in fact, never been used for tagging texts, but it can be shown that the parsing results might - among other purposes - be used for that purpose. The figure below shows all homographs in the German sentence

Trotz der schwierigen konjunkturellen Rahmenbedingungen in wichtigen Marktsegmenten unseres Unternehmens erhoeht sich das Ergebnis der gewoehnlichen Geschaeftstaetigkeit um 58 Prozent gegenueber dem Vorjahr.

that result from morphological analysis and dictionary look-up (all the output given is real output of the SUSY parser and MT system):

glossary of acronyms used

TEXT WORD             WKL   LEMMA NAME          STW

-------------------------------------------------------------

Trotz                 FIV   TROTZEN             VRB
                      SUB   TROTZ               SUB
                      PRP   TROTZ               FWK
der                   REL   D- (REL)            FWK
                      ARTB  D- (ARTB)           FWK
                      PER   D- (PER)            FWK
schwierigen           ADJ   SCHWIERIG           ADJ
konjunkturellen       ADJ   KONJUNKTURELL       ADJ
Rahmenbedingungen     SUB   /RAHMEN/BEDINGUNG   SUB
in                    PRP   IN (DATIV)          FWK
                      PRP   IN (AKKUSATIV)      FWK
wichtigen             ADJ   WICHTIG             ADJ
Marktsegmenten        SUB   /MARKT/SEGMENT      SUB
unseres               POSS  UNSR-               FWK
Unternehmens          SUB   UNTERNEHMEN (SUB)   SUB
                      SBI   UNTERNEHMEN         VRB
                      SBI   UNTERNEHMEN         VRB
erhoeht               ADP   ERHOEHEN            VRB
                      PTZ2  ERHOEHEN            VRB
                      FIV   ERHOEHEN            VRB
sich                  REF   ER/SIE/ES/SIE (REF) FWK
das                   REL   D- (REL)            FWK
                      ARTB  D- (ARTB)           FWK
                      PER   D-                  FWK
Ergebnis              SUB   ERGEBNIS            SUB
der                   REL   D- (REL)            FWK
                      ARTB  D- (ARTB)           FWK
                      PER   D- (PER)            FWK
gewoehnlichen         ADJ   GEWOEHNLICH         ADJ
Geschaeftstaetigkeit  SUB   GESCHAEFTSTAETIGKEIT SUB
um                    UOA   UM                  FWK
                      PRP   UM (HERUM)          FWK
                      VZS   UM                  FWK
                      PRP   UM (WILLEN)         FWK
                      PRP   UM (PRP)            FWK
58                    NUM   58                  FWK
Prozent               SUB   PROZENT             SUB
gegenueber            PRP   GEGENUEBER (PRP)    FWK
                      ADV   GEGENUEBER (ADV)    FWK
                      VZS   GEGENUEBER (VZS)    FWK
dem                   REL   D- (REL)            FWK
                      ARTB  D- (ARTB)           FWK
                      PER   D-                  FWK
Vorjahr               SUB   VORJAHR             SUB

*                           *


top of page


In SUSY, a strategy has been followed that tries to solve as many ambiguities as possible at an early stage of the parsing process in order to make parsing faster, without taking premature decisions. This was achieved by a hybrid system of routines that gives weights to pairs of parts of speech and employs partial parses. E.g. it would give the pair 'preposition + article' a higher weight than 'noun + relative pronoun' as in the sample sentence (Trotz der ...). Also it would eliminate readings on syntactic grounds, e.g. it would eliminate the 'conjunction' reading of 'um', because 'um' is not followed by an infinitive verb. These criteria suffice to disambiguate all the homographs in this sentence and produce an unambiguous basis for tagging the sentence:

glossary of acronyms used

TEXT WORD             WKL   LEMMA NAME           STW


-------------------------------------------------------------


Trotz                 PRP   TROTZ                FWK
der                   ARTB  D- (ARTB)            FWK
schwierigen           ADJ   SCHWIERIG            ADJ
konjunkturellen       ADJ   KONJUNKTURELL        ADJ
Rahmenbedingungen     SUB   /RAHMEN/BEDINGUNG    SUB
in                    PRP   IN (DAT)             FWK
wichtigen             ADJ   WICHTIG              ADJ
Marktsegmenten        SUB   /MARKT/SEGMENT       SUB
unseres               POSS  UNSR-                FWK
Unternehmens          SUB   UNTERNEHMEN (SUB)    SUB
erhoeht               FIV   ERHOEHEN             VRB
sich                  PER   ER/SIE/ES/SIE (REF)  FWK
das                   ARTB  D- (ARTB)            FWK
Ergebnis              SUB   ERGEBNIS             SUB
der                   ARTB  D- (ARTB)            FWK
gewoehnlichen         ADJ   GEWOEHNLICH          ADJ
Geschaeftstaetigkeit  SUB   GESCHAEFTSTAETIGKEIT SUB
um                    PRP   UM (PRP)             FWK
58                    NUM   58                   FWK
Prozent               SUB   PROZENT              SUB
gegenueber            PRP   GEGENUEBER (PRP)     FWK
dem                   ARTB  D- (ARTB)            FWK
Vorjahr               SUB   VORJAHR              SUB

*                           *


As mentioned above, there is no actual SUSY tagger, but a good programmer could elicit all that is necessary from the parser's output tables. Of course, only a small portion of the output is shown here. It may be safely assumed that much more detailed information could be assigned to the text words than a tagger would normally produce. But as tagging is not in the center of interest here, I shall leave it at that.

TEXT WORD            WKL    LEMMA NAME        STW
-------------------------------------------------------------
Kostensenkungen      SUB   /KOSTEN/SENKUNG    SUB
und                  NKO   UND                FWK
Produktivitaetsfort  SUB   /PRODUKTIVITAET*S/ SUB
schritte                   FORTSCHRITT
gehen                FIV   GEHEN              VRB
nicht                ADV   NICHT              FWK
zu Lasten            PRP   ZU LASTEN          FWK
der                  ARTB  D- (ARTB)          FWK
Qualitaet            SUB   QUALITAET          SUB
*                          *