- Source: DisCoCat
DisCoCat (Categorical Compositional Distributional) is a mathematical framework for natural language processing which uses category theory to unify distributional semantics with the principle of compositionality. The grammatical derivations in a categorial grammar (usually a pregroup grammar) are interpreted as linear maps acting on the tensor product of word vectors to produce the meaning of a sentence or a piece of text. String diagrams are used to visualise information flow and reason about natural language semantics.
History
The framework was first introduced by Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark as an application of categorical quantum mechanics to natural language processing. It started with the observation that pregroup grammars and quantum processes shared a common mathematical structure: they both form a rigid category (also known as a non-symmetric compact closed category). As such, they both benefit from a graphical calculus, which allows a purely diagrammatic reasoning. Although the analogy with quantum mechanics was kept informal at first, it eventually led to the development of quantum natural language processing.
Definition
There are multiple definitions of DisCoCat in the literature, depending on the choice made for the compositional aspect of the model. The common denominator between all the existent versions, however, always involves a categorical definition of DisCoCat as a structure-preserving functor from a category of grammar to a category of semantics, which usually encodes the distributional hypothesis.
The original paper used the categorical product of FinVect with a pregroup seen as a posetal category. This approach has some shortcomings: all parallel arrows of a posetal category are equal, which means that pregroups cannot distinguish between different grammatical derivations for the same syntactically ambiguous sentence. A more intuitive manner of saying the same is that one works with diagrams rather than with partial orders when describing grammar.
This problem is overcome when one considers the free rigid category
G
{\displaystyle \mathbf {G} }
generated by the pregroup grammar. That is,
G
{\displaystyle \mathbf {G} }
has generating objects for the words and the basic types of the grammar, and generating arrows
w
→
t
{\displaystyle w\to t}
for the dictionary entries which assign a pregroup type
t
{\displaystyle t}
to a word
w
{\displaystyle w}
. The arrows
f
:
w
1
…
w
n
→
s
{\displaystyle f:w_{1}\dots w_{n}\to s}
are grammatical derivations for the sentence
w
1
…
w
n
{\displaystyle w_{1}\dots w_{n}}
which can be represented as string diagrams with cups and caps, i.e. adjunction units and counits.
With this definition of pregroup grammars as free rigid categories, DisCoCat models can be defined as strong monoidal functors
F
:
G
→
F
i
n
V
e
c
t
{\displaystyle F:\mathbf {G} \to \mathbf {FinVect} }
. Spelling things out in detail, they assign a finite dimensional vector space
F
(
x
)
{\displaystyle F(x)}
to each basic type
x
{\displaystyle x}
and a vector
F
(
w
)
∈
F
(
t
)
=
F
(
t
1
)
⊗
⋯
⊗
F
(
t
n
)
{\displaystyle F(w)\in F(t)=F(t_{1})\otimes \dots \otimes F(t_{n})}
in the appropriate tensor product space to each dictionary entry
w
→
t
{\displaystyle w\to t}
where
t
=
t
1
…
t
n
{\displaystyle t=t_{1}\dots t_{n}}
(objects for words are sent to the monoidal unit, i.e.
F
(
w
)
=
1
{\displaystyle F(w)=1}
). The meaning of a sentence
f
:
w
1
…
w
n
→
s
{\displaystyle f:w_{1}\dots w_{n}\to s}
is then given by a vector
F
(
f
)
∈
F
(
s
)
{\displaystyle F(f)\in F(s)}
which can be computed as the contraction of a tensor network.
The reason behind the choice of
F
i
n
V
e
c
t
{\displaystyle \mathbf {FinVect} }
as the category of semantics is that vector spaces are the usual setting of distributional reading in computational linguistics and natural language processing. The underlying idea of distributional hypothesis "A word is characterized by the company it keeps" is particularly relevant when assigning meaning to words like adjectives or verbs, whose semantic connotation is strongly dependent on context.
Variations
Variations of DisCoCat have been proposed with a different choice for the grammar category. The main motivation behind this lies in the fact that pregroup grammars have been proved to be weakly equivalent to context-free grammars. One example of variation chooses Combinatory categorial grammar as the grammar category.
List of linguistic phenomena
The DisCoCat framework has been used to study the following phenomena from linguistics.
Entailment
Coordination
Hyponymy and hypernymy
Ambiguity with density matrices
Discourse analysis
Anaphora and ellipsis
Language evolution
Applications in NLP
The DisCoCat framework has been applied to solve the following tasks in natural language processing.
Word-sense disambiguation
Semantic similarity
Question answering
Machine translation
Anaphora resolution
See also
Lambek calculus
Pregroup grammar
Distributional semantics
Principle of compositionality
String diagram
Categorical quantum mechanics
Quantum natural language processing
External links
DisCoPy, a Python toolkit for computing with string diagrams
lambeq, a Python library for quantum natural language processing
References
Kata Kunci Pencarian:
- DisCoCat
- Quantum natural language processing
- Bob Coecke
- FinVect
- Categorical quantum mechanics
- Applied category theory
- String diagram
- Mehrnoosh Sadrzadeh
- Quantum foundations