Smooth maximum

In mathematics, a smooth maximum of an indexed family x₁, ..., x_n of numbers is a smooth approximation to the maximum function $\max(x_{1},\ldots ,x_{n}),$ meaning a parametric family of functions $m_{\alpha }(x_{1},\ldots ,x_{n})$ such that for every $α$ , the function $m_{\alpha }$ is smooth, and the family converges to the maximum function $m_{\alpha }\to \max$ as $\alpha \to \infty$ . The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, $m_{\alpha }\to \max$ as $\alpha \to \infty$ and $m_{\alpha }\to \min$ as $\alpha \to -\infty$ . The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.

Examples[]

Smoothmax of (−x, x) versus x for various parameter values. Very smooth for

\alpha

=0.5, and more sharp for

\alpha

=8.

For large positive values of the parameter $\alpha >0$ , the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.

{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {\sum _{i=1}^{n}x_{i}e^{\alpha x_{i}}}{\sum _{i=1}^{n}e^{\alpha x_{i}}}}

${\mathcal {S}}_{\alpha }$ has the following properties:

${\mathcal {S}}_{\alpha }\to \max$ as $\alpha \to \infty$
${\mathcal {S}}_{0}$ is the arithmetic mean of its inputs
${\mathcal {S}}_{\alpha }\to \min$ as $\alpha \to -\infty$

The gradient of ${\mathcal {S}}_{\alpha }$ is closely related to softmax and is given by

\nabla _{x_{i}}{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {e^{\alpha x_{i}}}{\sum _{j=1}^{n}e^{\alpha x_{j}}}}[1+\alpha (x_{i}-{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n}))].

This makes the softmax function useful for optimization techniques that use gradient descent.

LogSumExp[]

Another smooth maximum is LogSumExp:

\mathrm {LSE} _{\alpha }(x_{1},\ldots ,x_{n})=(1/\alpha )\log(\exp(\alpha x_{1})+\ldots +\exp(\alpha x_{n}))

This can also be normalized if the $x_{i}$ are all non-negative, yielding a function with domain $[0,\infty )^{n}$ and range $[0,\infty )$ :

g(x_{1},\ldots ,x_{n})=\log(\exp(x_{1})+\ldots +\exp(x_{n})-(n-1))

The $(n-1)$ term corrects for the fact that $\exp(0)=1$ by canceling out all but one zero exponential, and $\log 1=0$ if all $x_{i}$ are zero.

p-Norm[]

Another smooth maximum is the p-norm:

||(x_{1},\ldots ,x_{n})||_{p}=\left(|x_{1}|^{p}+\cdots +|x_{n}|^{p}\right)^{1/p}

which converges to $||(x_{1},\ldots ,x_{n})||_{\infty }=\max _{1\leq i\leq n}|x_{i}|$ as $p\to \infty$ .

An advantage of the p-norm is that it is a norm. As such it is "scale invariant" (homogeneous): $||(\lambda x_{1},\ldots ,\lambda x_{n})||_{p}=|\lambda |\times ||(x_{1},\ldots ,x_{n})||_{p}$ , and it satisfies the triangular inequality.

Use in numerical methods[]

Other choices of smoothing function[]

{\mathcal {max}}_{\alpha }(x_{1},x_{2})=\left((x_{1}+x_{2})+{\sqrt {(x_{1}-x_{2})^{2}+\alpha }}\right)/2

^[1]

Where $\alpha \to 0$ is a parameter.

References[]

^ Biswas, Koushik; Kumar, Sandeep; Banerjee, Shilpak; Ashish Kumar Pandey (2021). "SMU: Smooth activation function for deep networks using smoothing maximum technique". arXiv:2111.04682.

https://www.johndcook.com/soft_maximum.pdf

M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," in Proc. ESANN, Apr. 2014, pp. 271-276. (https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf)

[1] Biswas, Koushik; Kumar, Sandeep; Banerjee, Shilpak; Ashish Kumar Pandey (2021). "SMU: Smooth activation function for deep networks using smoothing maximum technique". arXiv:2111.04682.

[1]