Search Results for “morphological dictionary”

Source: Morphological dictionary

In the fields of computational linguistics and applied linguistics, a morphological dictionary is a linguistic resource that contains correspondences between surface form and lexical forms of words. Surface forms of words are those found in natural language text. The corresponding lexical form of a surface form is the lemma followed by grammatical information (for example the part of speech, gender and number). In English give, gives, giving, gave and given are surface forms of the verb give. The lexical form would be "give", verb. There are two kinds of morphological dictionaries: morpheme-aligned dictionaries and full-form (non-aligned) dictionaries.

Notable examples and formalisms

= Universal Morphologies

morphological

= Finite State Transducers

morphological

Aachen

Aal
Aarau

nom<>:e<>:n
nom

nom

= Interlinear Glossed Text editors

=
Interlinear Glossed Text (IGT) is a popular formalism in language documentation, linguistic typology and other branches of linguistics and the philologies. Although IGT can be created without any specialized software (but just with a conventional editor), such specialized software has been developed, with notable examples such as Toolbox, the FieldWorks Language Explorer (FLEx) or open source alternatives such as Xigt. Toolbox and FLEx support semi-automated annotation by means of an internal morphological dictionary. Whenever a morphological segment is encountered for which an annotation in the dictionary can be found, this annotations is applied. Whenever a morphological segment is newly annotated, the annotation is stored in the dictionary. FLEx and Toolbox provide different editor functionalities for annotating text and editing dictionaries, so that additional information beyond that found in annotations can be added, but at its core, their formats provide aligned morphological dictionaries.
FLEx and Xigt are based on XML formats, Toolbox uses a plain text format with idiosyncratic "markers". FLEx and Toolbox are not directly interoperable with each other, but a semiautomated converter for Toolbox to FLEx does exist. Xigt comes with FLEx and Toolbox importers, but is less widely used that either FLEx or Toolbox. Their formats of FLEx and Toolbox are not intended for human consumption, nor are they well-supported by any processing software other than their native tools.

= OntoLex-Morph: A community standard for morphological dictionaries

=
OntoLex is a community standard for machine-readable dictionaries on the web. In 2019, the OntoLex-Morph module has been proposed to facilitate data modelling of morphology in lexicography, as well as to provide a data model for morphological dictionaries for Natural Language Processing. OntoLex-Morph does support both aligned and non-aligned morphological dictionaries. A specific goal is to establish interoperability between and among IGT dictionaries, FST lexicons and morphological dictionaries used for machine learning.

Types and structure of morphological dictionaries

= Aligned morphological dictionaries

=
In an aligned morphological dictionary, the correspondence between the surface form and the lexical form of a word is aligned at the character level, for example:

(h,h) (o,o) (u,u) (s,s) (e,e) (s,⟨n⟩), (θ,⟨pl⟩)
Where θ is the empty symbol and ⟨n⟩ signifies "noun", and ⟨pl⟩ signifies "plural".
In the example the left hand side is the surface form (input), and the right hand side is the lexical form (output). This order is used in morphological analysis where a lexical form is generated from a surface form. In morphological generation this order would be reversed.
Formally, if Σ is the alphabet of the input symbols, and

Γ

{\displaystyle \Gamma }

is the alphabet of the output symbols, an aligned morphological dictionary is a subset

A
⊂

2

(

L

∗

)

{\displaystyle A\subset 2^{(L^{*})}}

, where:

L
=
(
(
Σ
∪

θ

)
×
Γ
)
∪
(
Σ
×
(
Γ
∪

θ

)
)

{\displaystyle L=((\Sigma \cup {\theta })\times \Gamma )\cup (\Sigma \times (\Gamma \cup {\theta }))}

is the alphabet of all the possible alignments including the empty symbol. That is, an aligned morphological dictionary is a set of string in

L

∗

{\displaystyle L^{*}}

.

= Non-aligned morphological dictionaries (full-form dictionaries)

=
A non-aligned morphological dictionary (or full-form dictionary) is simply a set

U
⊂

2

(

Γ

∗

×

Σ

∗

)

{\displaystyle U\subset 2^{(\Gamma ^{*}\times \Sigma ^{*})}}

of pairs of input and output strings. A non-aligned morphological dictionary would represent the previous example as:

(houses, house⟨n⟩⟨pl⟩)
It is possible to convert a non-aligned dictionary into an aligned dictionary. Besides trivial alignments to the left or to the right, linguistically motivated alignments which align characters to their corresponding morphemes are possible.

= Lexical ambiguities

=
Frequently there exists more than one lexical form associated with a surface form of a word. For example, "house" may be a noun in the singular, /haʊs/, or may be a verb in the present tense, /haʊz/. As a result of this it is necessary to have a function which relates input strings with their corresponding output strings.
If we define the set

E
⊂

Σ

∗

{\displaystyle E\subset \Sigma ^{*}}

of input words such that

E
=

w
:
(
w
,

w
′

)
∈
U

{\displaystyle E={w:(w,w')\in U}}

, the correspondence function would be

τ
:
E
→

2

Γ

∗

{\displaystyle \tau :E\rightarrow 2^{\Gamma ^{*}}}

defined as

τ
(
w
)
=

w
′

:
(
w
,

w
′

)
∈
U

{\displaystyle \tau (w)=w':(w,w')\in U}

.