Search Results for “attention (machine learning)”

Source: Attention (machine learning)

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.
Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial recurrent neural network language translation system, but the later transformer design removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.
Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of leveraging information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state.

History

attention

= Predecessors

attention

= Recurrent attention

attention

= Transformer

attention

Attention

seq2seq

Attention

= Problem statement

", which should translate to "la zone de contrôle international ". Here, we use the special token as a control character to delimit the end of input for both the encoder and the decoder.
An input sequence of text

x

0

,

x

1

,
…

{\displaystyle x_{0},x_{1},\dots }

is processed by a neural network (which can be an LSTM, a Transformer encoder, or some other network) into a sequence of real-valued vectors

h

0

,

h

1

,
…

{\displaystyle h_{0},h_{1},\dots }

, where

h

{\displaystyle h}

stands for "hidden vector".
After the encoder has finished processing, the decoder starts operating over the hidden vectors, to produce an output sequence

y

0

,

y

1

,
…

{\displaystyle y_{0},y_{1},\dots }

, autoregressively. That is, it always takes as input both the hidden vectors produced by the encoder, and what the decoder itself has produced before, to produce the next output word:

(

h

0

,

h

1

,
…

{\displaystyle h_{0},h_{1},\dots }

, "") → "la"
(

h

0

,

h

1

,
…

{\displaystyle h_{0},h_{1},\dots }

, " la") → "la zone"
(

h

0

,

h

1

,
…

{\displaystyle h_{0},h_{1},\dots }

, " la zone") → "la zone de"
...
(

h

0

,

h

1

,
…

{\displaystyle h_{0},h_{1},\dots }

, " la zone de contrôle international") → "la zone de contrôle international "
Here, we use the special token as a control character to delimit the start of input for the decoder. The decoding terminates as soon as "" appears in the decoder output.

= Word alignment

=
In translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. In the I love you example above, the second word love is aligned with the third word aime. Stacking soft row vectors together for je, t', and aime yields an alignment matrix:

Sometimes, alignment can be multiple-to-multiple. For example, the English phrase look it up corresponds to cherchez-le. Thus, "soft" attention weights work better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector.
This view of the attention weights addresses some of the neural network explainability problem. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word I, so the network offers the word je. On the second pass of the decoder, 88% of the attention weight is on the third English word you, so it offers t'. On the last pass, 95% of the attention weight is on the second English word love, so it offers aime.

= Attention weights

=
As hand-crafting weights defeats the purpose of machine learning, the model must compute the attention weights on its own. Taking analogy from the language of database queries, we make the model construct a triple of vectors: key, query, and value. The rough idea is that we have a "database" in the form of a list of key-value pairs. The decoder send in a query, and obtain a reply in the form of a weighted sum of the values, where the weight is proportional to how closely the query resembles each key.
The decoder first processes the "" input partially, to obtain an intermediate vector

h

0

d

{\displaystyle h_{0}^{d}}

, the 0th hidden vector of decoder. Then, the intermediate vector is transformed by a linear map

W

Q

{\displaystyle W^{Q}}

into a query vector

q

0

=

h

0

d

W

Q

{\displaystyle q_{0}=h_{0}^{d}W^{Q}}

. Meanwhile, the hidden vectors outputted by the encoder are transformed by another linear map

W

K

{\displaystyle W^{K}}

into key vectors

k

0

=

h

0

W

K

,

k

1

=

h

1

W

K

,
…

{\displaystyle k_{0}=h_{0}W^{K},k_{1}=h_{1}W^{K},\dots }

. The linear maps are useful for providing the model with enough freedom to find the best way to represent the data.
Now, the query and keys are compared by taking dot products:

q

0

k

0

T

,

q

0

k

1

T

,
…

{\displaystyle q_{0}k_{0}^{T},q_{0}k_{1}^{T},\dots }

. Ideally, the model should have learned to compute the keys and values, such that

q

0

k

0

T

{\displaystyle q_{0}k_{0}^{T}}

is large,

q

0

k

1

T

{\displaystyle q_{0}k_{1}^{T}}

is small, and the rest are very small. This can be interpreted as saying that the attention weight should be mostly applied to the 0th hidden vector of the encoder, a little to the 1st, and essentially none to the rest.
In order to make a properly weighted sum, we need to transform this list of dot products into a probability distribution over

0
,
1
,
…

{\displaystyle 0,1,\dots }

. This can be accomplished by the softmax function, thus giving us the attention weights:

(

w

00

,

w

01

,
…
)
=

s
o
f
t
m
a
x

(

q

0

k

0

T

,

q

0

k

1

T

,
…
)

{\displaystyle (w_{00},w_{01},\dots )=\mathrm {softmax} (q_{0}k_{0}^{T},q_{0}k_{1}^{T},\dots )}

This is then used to compute the context vector:

c

0

=

w

00

v

0

+

w

01

v

1

+
⋯

{\displaystyle c_{0}=w_{00}v_{0}+w_{01}v_{1}+\cdots }

where

v

0

=

h

0

W

V

,

v

1

=

h

1

W

V

,
…

{\displaystyle v_{0}=h_{0}W^{V},v_{1}=h_{1}W^{V},\dots }

are the value vectors, linearly transformed by another matrix to provide the model with freedom to find the best way to represent values. Without the matrices

W

Q

,

W

K

,

W

V

{\displaystyle W^{Q},W^{K},W^{V}}

, the model would be forced to use the same hidden vector for both key and value, which might not be appropriate, as these two tasks are not the same.This is the dot-attention mechanism. The particular version described in this section is "decoder cross-attention", as the output context vector is used by the decoder, and the input keys and values come from the encoder, but the query comes from the decoder, thus "cross-attention".
More succinctly, we can write it as

c

0

=

A
t
t
e
n
t
i
o
n

(

h

0

d

W

Q

,
H

W

K

,
H

W

V

)
=

s
o
f
t
m
a
x

(
(

h

0

d

W

Q

)

(
H

W

K

)

T

)
(
H

W

V

)

{\displaystyle c_{0}=\mathrm {Attention} (h_{0}^{d}W^{Q},HW^{K},HW^{V})=\mathrm {softmax} ((h_{0}^{d}W^{Q})\;(HW^{K})^{T})(HW^{V})}

where the matrix

H

{\displaystyle H}

is the matrix whose rows are

h

0

,

h

1

,
…

{\displaystyle h_{0},h_{1},\dots }

. Note that the querying vector,

h

0

d

{\displaystyle h_{0}^{d}}

, is not necessarily the same as the key-value vector

h

0

{\displaystyle h_{0}}

. In fact, it is theoretically possible for query, key, and value vectors to all be different, though that is rarely done in practice.

Overview

Terminology

This attention scheme has been compared to the Query-Key analogy of relational databases. That comparison suggests an asymmetric role for the Query and Key vectors, where one item of interest (the Query vector "that") is matched against all possible items (the Key vectors of each word in the sentence). However, Attention's parallel calculations matches all words of a sentence with itself; therefore the roles of these vectors are symmetric. Possibly because the simplistic database analogy is flawed, much effort has gone into understand Attention further by studying their roles in focused settings, such as in-context learning, masked language tasks, stripped down transformers, bigram statistics, N-gram statistics, pairwise convolutions, and arithmetic factoring.

Variants

Many variants of attention implement soft weights, such as

fast weight programmers, or fast weight controllers (1992). A "slow" neural network outputs the "fast" weights of another neural network through outer products. The slow network learns by gradient descent. It was later renamed as "linearized self-attention".
Bahdanau-style attention, also referred to as additive attention,
Luong-style attention, which is known as multiplicative attention,
highly parallelizable self-attention introduced in 2016 as decomposable attention and successfully used in transformers a year later,
positional attention and factorized positional attention.
For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention, channel attention, or combinations.
Much effort has gone into understand Attention further by studying their roles in focused settings, such as in-context learning, masked language tasks, stripped down transformers, bigram statistics, N-gram statistics, pairwise convolutions, and arithmetic factoring.
These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Core Calculations section above.

= Self-attention

=

Self-attention is essentially the same as cross-attention, except that query, key, and value vectors all come from the same model. Both encoder and decoder can use self-attention, but with subtle differences.
For encoder self-attention, we can start with a simple encoder without self-attention, such as an "embedding layer", which simply converts each input word into a vector by a fixed lookup table. This gives a sequence of hidden vectors

h

0

,

h

1

,
…

{\displaystyle h_{0},h_{1},\dots }

. These can then be applied to a dot-product attention mechanism, to obtain

h

0

′

=

A
t
t
e
n
t
i
o
n

(

h

0

W

Q

,
H

W

K

,
H

W

V

)

h

1

′

=

A
t
t
e
n
t
i
o
n

(

h

1

W

Q

,
H

W

K

,
H

W

V

)

⋯

{\displaystyle {\begin{aligned}h_{0}'&=\mathrm {Attention} (h_{0}W^{Q},HW^{K},HW^{V})\\h_{1}'&=\mathrm {Attention} (h_{1}W^{Q},HW^{K},HW^{V})\\&\cdots \end{aligned}}}

or more succinctly,

H
′

=

A
t
t
e
n
t
i
o
n

(
H

W

Q

,
H

W

K

,
H

W

V

)

{\displaystyle H'=\mathrm {Attention} (HW^{Q},HW^{K},HW^{V})}

. This can be applied repeatedly, to obtain a multilayered encoder. This is the "encoder self-attention", sometimes called the "all-to-all attention", as the vector at every position can attend to every other.

= Masking

=
For decoder self-attention, all-to-all attention is inappropriate, because during the autoregressive decoding process, the decoder cannot attend to future outputs that has yet to be decoded. This can be solved by forcing the attention weights

w

i
j

=
0

{\displaystyle w_{ij}=0}

for all

i
<
j

{\displaystyle i
, called "causal masking". This attention mechanism is the "causally masked self-attention".

Optimizations

= Flash attention

=
The size of the attention matrix is proportional to the square of the number of input tokens. Therefore, when the input is long, calculating the attention matrix requires a lot of GPU memory. Flash attention is an implementation that reduces the memory needs and increases efficiency without sacrificing accuracy. It achieves this by partitioning the attention computation into smaller blocks that fit into the GPU's faster on-chip memory, reducing the need to store large intermediate matrices and thus lowering memory usage while increasing computational efficiency.

Mathematical representation

= Standard Scaled Dot-Product Attention

=
For matrices:

Q

∈

R

m
×

d

k

,

K

∈

R

n
×

d

k

{\displaystyle \mathbf {Q} \in \mathbb {R^{m\times d_{k}}} ,\mathbf {K} \in \mathbb {R^{n\times d_{k}}} }

and

V

∈

R

n
×

d

v

{\displaystyle \mathbf {V} \in \mathbb {R^{n\times d_{v}}} }

, the scaled dot-product, or QKV attention is defined as:

Attention

(

Q

,

K

,

V

)
=

softmax

(

Q

K

T

d

k

)

V

∈

R

m
×

d

v

{\displaystyle {\text{Attention}}(\mathbf {Q} ,\mathbf {K} ,\mathbf {V} )={\text{softmax}}\left({\frac {\mathbf {Q} \mathbf {K} ^{T}}{\sqrt {d_{k}}}}\right)\mathbf {V} \in \mathbb {R} ^{m\times d_{v}}}

where

T

{\displaystyle {}^{T}}

denotes transpose and the softmax function is applied independently to every row of its argument. The matrix

Q

{\displaystyle \mathbf {Q} }

contains

m

{\displaystyle m}

queries, while matrices

K

,

V

{\displaystyle \mathbf {K} ,\mathbf {V} }

jointly contain an unordered set of

n

{\displaystyle n}

key-value pairs. Value vectors in matrix

V

{\displaystyle \mathbf {V} }

are weighted using the weights resulting from the softmax operation, so that the rows of the

m

{\displaystyle m}

-by-

d

v

{\displaystyle d_{v}}

output matrix are confined to the convex hull of the points in

R

d

v

{\displaystyle \mathbb {R} ^{d_{v}}}

given by the rows of

V

{\displaystyle \mathbf {V} }

.
To understand the permutation invariance and permutation equivariance properties of QKV attention, let

A

∈

R

m
×
m

{\displaystyle \mathbf {A} \in \mathbb {R} ^{m\times m}}

and

B

∈

R

n
×
n

{\displaystyle \mathbf {B} \in \mathbb {R} ^{n\times n}}

be permutation matrices; and

D

∈

R

m
×
n

{\displaystyle \mathbf {D} \in \mathbb {R} ^{m\times n}}

an arbitrary matrix. The softmax function is permutation equivariant in the sense that:

softmax

(

A

D

B

)
=

A

softmax

(

D

)

B

{\displaystyle {\text{softmax}}(\mathbf {A} \mathbf {D} \mathbf {B} )=\mathbf {A} \,{\text{softmax}}(\mathbf {D} )\mathbf {B} }

By noting that the transpose of a permutation matrix is also its inverse, it follows that:

Attention

(

A

Q

,

B

K

,

B

V

)
=

A

Attention

(

Q

,

K

,

V

)

{\displaystyle {\text{Attention}}(\mathbf {A} \mathbf {Q} ,\mathbf {B} \mathbf {K} ,\mathbf {B} \mathbf {V} )=\mathbf {A} \,{\text{Attention}}(\mathbf {Q} ,\mathbf {K} ,\mathbf {V} )}

which shows that QKV attention is equivariant with respect to re-ordering the queries (rows of

Q

{\displaystyle \mathbf {Q} }

); and invariant to re-ordering of the key-value pairs in

K

,

V

{\displaystyle \mathbf {K} ,\mathbf {V} }

. These properties are inherited when applying linear transforms to the inputs and outputs of QKV attention blocks. For example, a simple self-attention function defined as:

X

↦

Attention

(

X

T

q

,

X

T

k

,

X

T

v

)

{\displaystyle \mathbf {X} \mapsto {\text{Attention}}(\mathbf {X} \mathbf {T} _{q},\mathbf {X} \mathbf {T} _{k},\mathbf {X} \mathbf {T} _{v})}

is permutation equivariant with respect to re-ordering the rows of the input matrix

X

{\displaystyle X}

in a non-trivial way, because every row of the output is a function of all the rows of the input. Similar properties hold for multi-head attention, which is defined below.

= Masked Attention

=
When QKV attention is used as a building block for an autoregressive decoder, and when at training time all input and output matrices have

n

{\displaystyle n}

rows, a masked attention variant is used:

Attention

(

Q

,

K

,

V

)
=

softmax

(

Q

K

T

d

k

+

M

)

V

{\displaystyle {\text{Attention}}(\mathbf {Q} ,\mathbf {K} ,\mathbf {V} )={\text{softmax}}\left({\frac {\mathbf {Q} \mathbf {K} ^{T}}{\sqrt {d_{k}}}}+\mathbf {M} \right)\mathbf {V} }

where the mask,

M

∈

R

n
×
n

{\displaystyle \mathbf {M} \in \mathbb {R} ^{n\times n}}

is a strictly upper triangular matrix, with zeros on and below the diagonal and

−
∞

{\displaystyle -\infty }

in every element above the diagonal. The softmax output, also in

R

n
×
n

{\displaystyle \mathbb {R} ^{n\times n}}

is then lower triangular, with zeros in all elements above the diagonal. The masking ensures that for all

1
≤
i
<
j
≤
n

{\displaystyle 1\leq i
, row

i

{\displaystyle i}

of the attention output is independent of row

j

{\displaystyle j}

of any of the three input matrices. The permutation invariance and equivariance properties of standard QKV attention do not hold for the masked variant.

= Multi-Head Attention

=

Multi-head attention

MultiHead

(

Q

,

K

,

V

)
=

Concat

(

head

1

,
.
.
.
,

head

h

)

W

O

{\displaystyle {\text{MultiHead}}(\mathbf {Q} ,\mathbf {K} ,\mathbf {V} )={\text{Concat}}({\text{head}}_{1},...,{\text{head}}_{h})\mathbf {W} ^{O}}

where each head is computed with QKV attention as:

head

i

=

Attention

(

Q

W

i

Q

,

K

W

i

K

,

V

W

i

V

)

{\displaystyle {\text{head}}_{i}={\text{Attention}}(\mathbf {Q} \mathbf {W} _{i}^{Q},\mathbf {K} \mathbf {W} _{i}^{K},\mathbf {V} \mathbf {W} _{i}^{V})}

and

W

i

Q

,

W

i

K

,

W

i

V

{\displaystyle \mathbf {W} _{i}^{Q},\mathbf {W} _{i}^{K},\mathbf {W} _{i}^{V}}

, and

W

O

{\displaystyle \mathbf {W} ^{O}}

are parameter matrices.
The permutation properties of (standard, unmasked) QKV attention apply here also. For permutation matrices,

A

,

B

{\displaystyle \mathbf {A} ,\mathbf {B} }

:

MultiHead

(

A

Q

,

B

K

,

B

V

)
=

A

MultiHead

(

Q

,

K

,

V

)

{\displaystyle {\text{MultiHead}}(\mathbf {A} \mathbf {Q} ,\mathbf {B} \mathbf {K} ,\mathbf {B} \mathbf {V} )=\mathbf {A} \,{\text{MultiHead}}(\mathbf {Q} ,\mathbf {K} ,\mathbf {V} )}

from which we also see that multi-head self-attention:

X

↦

MultiHead

(

X

T

q

,

X

T

k

,

X

T

v

)

{\displaystyle \mathbf {X} \mapsto {\text{MultiHead}}(\mathbf {X} \mathbf {T} _{q},\mathbf {X} \mathbf {T} _{k},\mathbf {X} \mathbf {T} _{v})}

is equivariant with respect to re-ordering of the rows of input matrix

X

{\displaystyle X}

.

= Bahdanau (Additive) Attention

=

Attention

(
Q
,
K
,
V
)
=

softmax

(
e
)
V

{\displaystyle {\text{Attention}}(Q,K,V)={\text{softmax}}(e)V}

where

e
=
tanh
⁡
(

W

Q

Q
+

W

K

K
)

{\displaystyle e=\tanh(W_{Q}Q+W_{K}K)}

and

W

Q

{\displaystyle W_{Q}}

and

W

K

{\displaystyle W_{K}}

are learnable weight matrices.

= Luong Attention (General)

=

Attention

(
Q
,
K
,
V
)
=

softmax

(
Q

W

a

K

T

)
V

{\displaystyle {\text{Attention}}(Q,K,V)={\text{softmax}}(QW_{a}K^{T})V}

where

W

a

{\displaystyle W_{a}}

is a learnable weight matrix.

References

External links

Olah, Chris; Carter, Shan (September 8, 2016). "Attention and Augmented Recurrent Neural Networks". Distill. 1 (9). Distill Working Group. doi:10.23915/distill.00001.
Dan Jurafsky and James H. Martin (2022) Speech and Language Processing (3rd ed. draft, January 2022), ch. 10.4 Attention and ch. 9.7 Self-Attention Networks: Transformers
Alex Graves (4 May 2020), Attention and Memory in Deep Learning (video lecture), DeepMind / UCL, via YouTube

History

= Predecessors

= Recurrent attention

= Transformer

seq2seq

= Problem statement

= Word alignment

= Attention weights

Overview

Terminology

Variants

= Self-attention

= Masking

Optimizations

= Flash attention

Mathematical representation

= Standard Scaled Dot-Product Attention

= Masked Attention

= Multi-Head Attention

= Bahdanau (Additive) Attention

= Luong Attention (General)

See also

References

External links

Kata Kunci Pencarian:

Recent Movies

Recent Movies

Categories

Recent Movies