stochastic approximation - GudangMovies21

Source: Stochastic approximation

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.
In a nutshell, stochastic approximation algorithms deal with a function of the form

f
(
θ
)
=

E

ξ

⁡
[
F
(
θ
,
ξ
)
]

{\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]}

which is the expected value of a function depending on a random variable

ξ

{\textstyle \xi }

. The goal is to recover properties of such a function

f

{\textstyle f}

without evaluating it directly. Instead, stochastic approximation algorithms use random samples of

F
(
θ
,
ξ
)

{\textstyle F(\theta ,\xi )}

to efficiently approximate properties of

f

{\textstyle f}

such as zeros or extrema.
Recently, stochastic approximations have found extensive applications in the fields of statistics and machine learning, especially in settings with big data. These applications range from stochastic optimization methods and algorithms, to online forms of the EM algorithm, reinforcement learning via temporal differences, and deep learning, and others.
Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.
The earliest, and prototypical, algorithms of this kind are the Robbins–Monro and Kiefer–Wolfowitz algorithms introduced respectively in 1951 and 1952.

Robbins–Monro algorithm

= Example

stochastic

= Complexity results

= Subsequent developments and Polyak–Ruppert averaging

stochastic

= Application in stochastic optimization

stochastic

Kiefer–Wolfowitz algorithm

approximation

There exists

ρ
>
0

{\displaystyle \rho >0}

and

R
>
0

{\displaystyle R>0}

such that

|

x
′

−

x
″

|

<
ρ

⟹

|

M
(

x
′

)
−
M
(

x
″

)

|

<
R

{\displaystyle |x'-x''|<\rho \quad \Longrightarrow \quad |M(x')-M(x'')|

For every

δ
>
0

{\displaystyle \delta >0}

, there exists some

π
(
δ
)
>
0

{\displaystyle \pi (\delta )>0}

such that

|

z
−
θ

|

>
δ

⟹

inf

δ

/

2
>
ε
>
0

|

M
(
z
+
ε
)
−
M
(
z
−
ε
)

|

ε

>
π
(
δ
)

{\displaystyle |z-\theta |>\delta \quad \Longrightarrow \quad \inf _{\delta /2>\varepsilon >0}{\frac {|M(z+\varepsilon )-M(z-\varepsilon )|}{\varepsilon }}>\pi (\delta )}

The selected sequences

{

a

n

}

{\displaystyle \{a_{n}\}}

and

{

c

n

}

{\displaystyle \{c_{n}\}}

must be infinite sequences of positive numbers such that

c

n

→
0

as

n
→
∞

{\displaystyle \quad c_{n}\rightarrow 0\quad {\text{as}}\quad n\to \infty }

∑

n
=
0

∞

a

n

=
∞

{\displaystyle \sum _{n=0}^{\infty }a_{n}=\infty }

∑

n
=
0

∞

a

n

c

n

<
∞

{\displaystyle \sum _{n=0}^{\infty }a_{n}c_{n}<\infty }

∑

n
=
0

∞

a

n

2

c

n

−
2

<
∞

{\displaystyle \sum _{n=0}^{\infty }a_{n}^{2}c_{n}^{-2}<\infty }

A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would be

a

n

=
1

/

n

{\displaystyle a_{n}=1/n}

and

c

n

=

n

−
1

/

3

{\displaystyle c_{n}=n^{-1/3}}

.

= Subsequent developments and important issues

=
The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least

d
+
1

{\displaystyle d+1}

different parameter values must be simulated for every iteration of the algorithm, where

d

{\displaystyle d}

is the dimension of the search space. This means that when

d

{\displaystyle d}

is large, the Kiefer–Wolfowitz algorithm will require substantial computational effort per iteration, leading to slow convergence.
To address this problem, Spall proposed the use of simultaneous perturbations to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension

d

{\displaystyle d}

.
In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.

Further developments

An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on. These methods are also applied in control theory, in which case the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step size

a

n

{\displaystyle a_{n}}

should not converge to zero but should be chosen so as to track the function., 2nd ed., chapter 3
C. Johan Masreliez and R. Douglas Martin were the first to apply
stochastic approximation to robust estimation.
The main tool for analyzing stochastic approximations algorithms (including the Robbins–Monro and the Kiefer–Wolfowitz algorithms) is a theorem by Aryeh Dvoretzky published in 1956.

Robbins–Monro algorithm

= Example

= Complexity results

= Subsequent developments and Polyak–Ruppert averaging

= Application in stochastic optimization

Kiefer–Wolfowitz algorithm

= Subsequent developments and important issues

Further developments

See also

References

Kata Kunci Pencarian:

Robbins–Monro algorithm

= Example

= Complexity results

= Subsequent developments and Polyak–Ruppert averaging

= Application in stochastic optimization

Kiefer–Wolfowitz algorithm

= Subsequent developments and important issues

Further developments

See also

References

Kata Kunci Pencarian:

TAG FAVORIT

GENRE