In mathematics , a smooth maximum of an indexed family x 1 , ..., x n of numbers is a smooth approximation to the maximum function
max
(
x
1
,
…
,
x
n
)
,
{\displaystyle \max(x_{1},\ldots ,x_{n}),}
meaning a parametric family of functions
m
α
(
x
1
,
…
,
x
n
)
{\displaystyle m_{\alpha }(x_{1},\ldots ,x_{n})}
such that for every α , the function
m
α
{\displaystyle m_{\alpha }}
is smooth, and the family converges to the maximum function
m
α
→
max
{\displaystyle m_{\alpha }\to \max }
as
α
→
∞
{\displaystyle \alpha \to \infty }
. The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols,
m
α
→
max
{\displaystyle m_{\alpha }\to \max }
as
α
→
∞
{\displaystyle \alpha \to \infty }
and
m
α
→
min
{\displaystyle m_{\alpha }\to \min }
as
α
→
−
∞
{\displaystyle \alpha \to -\infty }
. The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.
Examples [ ]
Smoothmax of (−x, x) versus x for various parameter values. Very smooth for
α
{\displaystyle \alpha }
=0.5, and more sharp for
α
{\displaystyle \alpha }
=8.
For large positive values of the parameter
α
>
0
{\displaystyle \alpha >0}
, the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.
S
α
(
x
1
,
…
,
x
n
)
=
∑
i
=
1
n
x
i
e
α
x
i
∑
i
=
1
n
e
α
x
i
{\displaystyle {\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {\sum _{i=1}^{n}x_{i}e^{\alpha x_{i}}}{\sum _{i=1}^{n}e^{\alpha x_{i}}}}}
S
α
{\displaystyle {\mathcal {S}}_{\alpha }}
has the following properties:
S
α
→
max
{\displaystyle {\mathcal {S}}_{\alpha }\to \max }
as
α
→
∞
{\displaystyle \alpha \to \infty }
S
0
{\displaystyle {\mathcal {S}}_{0}}
is the arithmetic mean of its inputs
S
α
→
min
{\displaystyle {\mathcal {S}}_{\alpha }\to \min }
as
α
→
−
∞
{\displaystyle \alpha \to -\infty }
The gradient of
S
α
{\displaystyle {\mathcal {S}}_{\alpha }}
is closely related to softmax and is given by
∇
x
i
S
α
(
x
1
,
…
,
x
n
)
=
e
α
x
i
∑
j
=
1
n
e
α
x
j
[
1
+
α
(
x
i
−
S
α
(
x
1
,
…
,
x
n
)
)
]
.
{\displaystyle \nabla _{x_{i}}{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {e^{\alpha x_{i}}}{\sum _{j=1}^{n}e^{\alpha x_{j}}}}[1+\alpha (x_{i}-{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n}))].}
This makes the softmax function useful for optimization techniques that use gradient descent .
LogSumExp [ ]
Another smooth maximum is LogSumExp :
L
S
E
α
(
x
1
,
…
,
x
n
)
=
(
1
/
α
)
log
(
exp
(
α
x
1
)
+
…
+
exp
(
α
x
n
)
)
{\displaystyle \mathrm {LSE} _{\alpha }(x_{1},\ldots ,x_{n})=(1/\alpha )\log(\exp(\alpha x_{1})+\ldots +\exp(\alpha x_{n}))}
This can also be normalized if the
x
i
{\displaystyle x_{i}}
are all non-negative, yielding a function with domain
[
0
,
∞
)
n
{\displaystyle [0,\infty )^{n}}
and range
[
0
,
∞
)
{\displaystyle [0,\infty )}
:
g
(
x
1
,
…
,
x
n
)
=
log
(
exp
(
x
1
)
+
…
+
exp
(
x
n
)
−
(
n
−
1
)
)
{\displaystyle g(x_{1},\ldots ,x_{n})=\log(\exp(x_{1})+\ldots +\exp(x_{n})-(n-1))}
The
(
n
−
1
)
{\displaystyle (n-1)}
term corrects for the fact that
exp
(
0
)
=
1
{\displaystyle \exp(0)=1}
by canceling out all but one zero exponential, and
log
1
=
0
{\displaystyle \log 1=0}
if all
x
i
{\displaystyle x_{i}}
are zero.
p-Norm [ ]
Main article: P-norm
Another smooth maximum is the p-norm :
|
|
(
x
1
,
…
,
x
n
)
|
|
p
=
(
|
x
1
|
p
+
⋯
+
|
x
n
|
p
)
1
/
p
{\displaystyle ||(x_{1},\ldots ,x_{n})||_{p}=\left(|x_{1}|^{p}+\cdots +|x_{n}|^{p}\right)^{1/p}}
which converges to
|
|
(
x
1
,
…
,
x
n
)
|
|
∞
=
max
1
≤
i
≤
n
|
x
i
|
{\displaystyle ||(x_{1},\ldots ,x_{n})||_{\infty }=\max _{1\leq i\leq n}|x_{i}|}
as
p
→
∞
{\displaystyle p\to \infty }
.
An advantage of the p-norm is that it is a norm . As such it is "scale invariant" (homogeneous):
|
|
(
λ
x
1
,
…
,
λ
x
n
)
|
|
p
=
|
λ
|
×
|
|
(
x
1
,
…
,
x
n
)
|
|
p
{\displaystyle ||(\lambda x_{1},\ldots ,\lambda x_{n})||_{p}=|\lambda |\times ||(x_{1},\ldots ,x_{n})||_{p}}
, and it satisfies the triangular inequality.
Use in numerical methods [ ]
This section is empty. You can help by . (February 2015 )
Other choices of smoothing function [ ]
m
a
x
α
(
x
1
,
x
2
)
=
(
(
x
1
+
x
2
)
+
(
x
1
−
x
2
)
2
+
α
)
/
2
{\displaystyle {\mathcal {max}}_{\alpha }(x_{1},x_{2})=\left((x_{1}+x_{2})+{\sqrt {(x_{1}-x_{2})^{2}+\alpha }}\right)/2}
[1]
Where
α
→
0
{\displaystyle \alpha \to 0}
is a parameter.
See also [ ]
References [ ]
https://www.johndcook.com/soft_maximum.pdf
M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," in Proc. ESANN , Apr. 2014, pp. 271-276.
(https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf )