Search Results for “reduction operator”

Source: Reduction operator

In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often (but not necessarily) commutative. The reduction of sets of elements is an integral part of programming models such as Map Reduce, where a reduction operator is applied (mapped) to all elements before they are reduced. Other parallel algorithms use reduction operators as primary operations to solve more complex problems. Many reduction operators can be used for broadcasting to distribute data to all processors.

Theory

reduction

= Example

= Nonexample

reduction

operator

reduction

Algorithms

= Binomial tree algorithms

x

i

←

x

i

⊕

⋆

x

i
+

2

k

{\displaystyle x_{i}\gets x_{i}\oplus ^{\star }x_{i+2^{k}}}

The binary operator for vectors is defined element-wise such that

(

e

i

0

⋮

e

i

m
−
1

)

⊕

⋆

(

e

j

0

⋮

e

j

m
−
1

)

=

(

e

i

0

⊕

e

j

0

⋮

e

i

m
−
1

⊕

e

j

m
−
1

)

.

{\displaystyle {\begin{pmatrix}e_{i}^{0}\\\vdots \\e_{i}^{m-1}\end{pmatrix}}\oplus ^{\star }{\begin{pmatrix}e_{j}^{0}\\\vdots \\e_{j}^{m-1}\end{pmatrix}}={\begin{pmatrix}e_{i}^{0}\oplus e_{j}^{0}\\\vdots \\e_{i}^{m-1}\oplus e_{j}^{m-1}\end{pmatrix}}.}

The algorithm further assumes that in the beginning

x

i

=

v

i

{\displaystyle x_{i}=v_{i}}

for all

i

{\displaystyle i}

and

p

{\displaystyle p}

is a power of two and uses the processing units

p

0

,

p

1

,
…

p

n
−
1

{\displaystyle p_{0},p_{1},\dots p_{n-1}}

. In every iteration, half of the processing units become inactive and do not contribute to further computations. The figure shows a visualization of the algorithm using addition as the operator. Vertical lines represent the processing units where the computation of the elements on that line take place. The eight input elements are located on the bottom and every animation step corresponds to one parallel step in the execution of the algorithm. An active processor

p

i

{\displaystyle p_{i}}

evaluates the given operator on the element

x

i

{\displaystyle x_{i}}

it is currently holding and

x

j

{\displaystyle x_{j}}

where

j

{\displaystyle j}

is the minimal index fulfilling

j
>
i

{\displaystyle j>i}

, so that

p

j

{\displaystyle p_{j}}

is becoming an inactive processor in the current step.

x

i

{\displaystyle x_{i}}

and

x

j

{\displaystyle x_{j}}

are not necessarily elements of the input set

X

{\displaystyle X}

as the fields are overwritten and reused for previously evaluated expressions. To coordinate the roles of the processing units in each step without causing additional communication between them, the fact that the processing units are indexed with numbers from

0

{\displaystyle 0}

to

p
−
1

{\displaystyle p-1}

is used. Each processor looks at its

k

{\displaystyle k}

-th least significant bit and decides whether to get inactive or compute the operator on its own element and the element with the index where the

k

{\displaystyle k}

-th bit is not set. The underlying communication pattern of the algorithm is a binomial tree, hence the name of the algorithm.
Only

p

0

{\displaystyle p_{0}}

holds the result in the end, therefore it is the root processor. For an Allreduce operation the result has to be distributed, which can be done by appending a broadcast from

p

0

{\displaystyle p_{0}}

. Furthermore, the number

p

{\displaystyle p}

of processors is restricted to be a power of two. This can be lifted by padding the number of processors to the next power of two. There are also algorithms that are more tailored for this use-case.

= Runtime analysis

=
The main loop is executed

⌈

log

2

⁡
p
⌉

{\displaystyle \lceil \log _{2}p\rceil }

times, the time needed for the part done in parallel is in

O

(
m
)

{\displaystyle {\mathcal {O}}(m)}

as a processing unit either combines two vectors or becomes inactive. Thus the parallel time

T
(
p
,
m
)

{\displaystyle T(p,m)}

for the PRAM is

T
(
p
,
m
)
=

O

(
log
⁡
(
p
)
⋅
m
)

{\displaystyle T(p,m)={\mathcal {O}}(\log(p)\cdot m)}

. The strategy for handling read and write conflicts can be chosen as restrictive as an exclusive read and exclusive write (EREW). The speedup

S
(
p
,
m
)

{\displaystyle S(p,m)}

of the algorithm is

S
(
p
,
m
)
∈

O

(

T

seq

T
(
p
,
m
)

)

=

O

(

p

log
⁡
(
p
)

)

{\textstyle S(p,m)\in {\mathcal {O}}\left({\frac {T_{\text{seq}}}{T(p,m)}}\right)={\mathcal {O}}\left({\frac {p}{\log(p)}}\right)}

and therefore the efficiency is

E
(
p
,
m
)
∈

O

(

S
(
p
,
m
)

p

)

=

O

(

1

log
⁡
(
p
)

)

{\textstyle E(p,m)\in {\mathcal {O}}\left({\frac {S(p,m)}{p}}\right)={\mathcal {O}}\left({\frac {1}{\log(p)}}\right)}

. The efficiency suffers because half of the active processing units become inactive after each step, so

p

2

i

{\displaystyle {\frac {p}{2^{i}}}}

units are active in step

i

{\displaystyle i}

.

Distributed memory algorithm

In contrast to the PRAM-algorithm, in the distributed memory model, memory is not shared between processing units and data has to be exchanged explicitly between processing units. Therefore, data has to be exchanged explicitly between units, as can be seen in the following algorithm.

for

k
←
0

{\displaystyle k\gets 0}

to

⌈

log

2

⁡
p
⌉
−
1

{\displaystyle \lceil \log _{2}p\rceil -1}

do
for

i
←
0

{\displaystyle i\gets 0}

to

p
−
1

{\displaystyle p-1}

do in parallel
if

p

i

{\displaystyle p_{i}}

is active then
if bit

k

{\displaystyle k}

of

i

{\displaystyle i}

is set then
send

x

i

{\displaystyle x_{i}}

to

p

i
−

2

k

{\displaystyle p_{i-2^{k}}}

set

p

k

{\displaystyle p_{k}}

to inactive
else if

i
+

2

k

<
p

{\displaystyle i+2^{k}

receive

x

i
+

2

k

{\displaystyle x_{i+2^{k}}}

x

i

←

x

i

⊕

⋆

x

i
+

2

k

{\displaystyle x_{i}\gets x_{i}\oplus ^{\star }x_{i+2^{k}}}

The only difference between the distributed algorithm and the PRAM version is the inclusion of explicit communication primitives, the operating principle stays the same.

= Runtime analysis

=
The communication between units leads to some overhead. A simple analysis for the algorithm uses the BSP-model and incorporates the time

T

start

{\displaystyle T_{\text{start}}}

needed to initiate communication and

T

byte

{\displaystyle T_{\text{byte}}}

the time needed to send a byte. Then the resulting runtime is

Θ
(
(

T

start

+
n
⋅

T

byte

)
⋅
l
o
g
(
p
)
)

{\displaystyle \Theta ((T_{\text{start}}+n\cdot T_{\text{byte}})\cdot log(p))}

, as

m

{\displaystyle m}

elements of a vector are sent in each iteration and have size

n

{\displaystyle n}

in total.

= Pipeline-algorithm

=
For distributed memory models, it can make sense to use pipelined communication. This is especially the case when

T

start

{\displaystyle T_{\text{start}}}

is small in comparison to

T

byte

{\displaystyle T_{\text{byte}}}

. Usually, linear pipelines split data or a tasks into smaller pieces and process them in stages. In contrast to the binomial tree algorithms, the pipelined algorithm uses the fact that the vectors are not inseparable, but the operator can be evaluated for single elements:
for

k
←
0

{\displaystyle k\gets 0}

to

p
+
m
−
3

{\displaystyle p+m-3}

do
for

i
←
0

{\displaystyle i\gets 0}

to

p
−
1

{\displaystyle p-1}

do in parallel
if

i
≤
k
<
i
+
m
∧
i
≠
p
−
1

{\displaystyle i\leq k

send

x

i

k
−
i

{\displaystyle x_{i}^{k-i}}

to

p

i
+
1

{\displaystyle p_{i+1}}

if

i
−
1
≤
k
<
i
−
1
+
m
∧
i
≠
0

{\displaystyle i-1\leq k

receive

x

i
−
1

k
+
i
−
1

{\displaystyle x_{i-1}^{k+i-1}}

from

p

i
−
1

{\displaystyle p_{i-1}}

x

i

k
+
i
−
1

←

x

i

k
+
i
−
1

⊕

x

i
−
1

k
+
i
−
1

{\displaystyle x_{i}^{k+i-1}\gets x_{i}^{k+i-1}\oplus x_{i-1}^{k+i-1}}

It is important to note that the send and receive operations have to be executed concurrently for the algorithm to work. The result vector is stored at

p

p
−
1

{\displaystyle p_{p-1}}

at the end. The associated animation shows an execution of the algorithm on vectors of size four with five processing units. Two steps of the animation visualize one parallel execution step.

Runtime analysis

The number of steps in the parallel execution are

p
+
m
−
2

{\displaystyle p+m-2}

, it takes

p
−
1

{\displaystyle p-1}

steps until the last processing unit receives its first element and additional

m
−
1

{\displaystyle m-1}

until all elements are received. Therefore, the runtime in the BSP-model is

T
(
n
,
p
,
m
)
=

(

T

start

+

n
m

⋅

T

byte

)

(
p
+
m
−
2
)

{\textstyle T(n,p,m)=\left(T_{\text{start}}+{\frac {n}{m}}\cdot T_{\text{byte}}\right)(p+m-2)}

, assuming that

n

{\displaystyle n}

is the total byte-size of a vector.
Although

m

{\displaystyle m}

has a fixed value, it is possible to logically group elements of a vector together and reduce

m

{\displaystyle m}

. For example, a problem instance with vectors of size four can be handled by splitting the vectors into the first two and last two elements, which are always transmitted and computed together. In this case, double the volume is sent each step, but the number of steps has roughly halved. It means that the parameter

m

{\displaystyle m}

is halved, while the total byte-size

n

{\displaystyle n}

stays the same. The runtime

T
(
p
)

{\displaystyle T(p)}

for this approach depends on the value of

m

{\displaystyle m}

, which can be optimized if

T

start

{\displaystyle T_{\text{start}}}

and

T

byte

{\textstyle T_{\text{byte}}}

are known. It is optimal for

m
=

n
⋅
(
p
−
2
)
⋅

T

byte

T

start

{\textstyle m={\sqrt {\frac {n\cdot (p-2)\cdot T_{\text{byte}}}{T_{\text{start}}}}}}

, assuming that this results in a smaller

m

{\displaystyle m}

that divides the original one.

Applications

Reduction is one of the main collective operations implemented in the Message Passing Interface, where performance of the used algorithm is important and evaluated constantly for different use cases.
Operators can be used as parameters for MPI_Reduce and MPI_Allreduce, with the difference that the result is available at one (root) processing unit or all of them.
OpenMP offers a reduction clause for describing how the results from parallel operations are collected together.
MapReduce relies heavily on efficient reduction algorithms to process big data sets, even on huge clusters.
Some parallel sorting algorithms use reductions to be able to handle very big data sets.

Theory

= Example

= Nonexample

Algorithms

= Binomial tree algorithms

= Pipeline-algorithm

Applications

See also

References

Kata Kunci Pencarian:

Recent Movies

Recent Movies

Categories

Recent Movies