# Universal approximation of neural networks

Universal approximation of Neural Networks

You may wonder why neural networks are so powerful and seem to successfully solve a wide range of data challenges like a panacea. One mathematical justification may lie in the fact that neural networks provide a rich enough family of models to describe the complex functional relationship between data. This property is precisely described by the following universality theorem of neural networks (UAT).

# Universality of Shallow Neural Networks

Let us focus on shallow neural networks for now. Suppose that we have a continuous map (F) to learn from data, e.g. the relationship between past price movements and the next daily price change. The following theorem states no matter what the function (F) is, for any given error tolerance level (epsilon>0), there exists some shallow neural network (F_{Theta}) such as to approximate (F) up to (epsilon).

Theorem 1 (UAT for Shallow Neural Network [1]) Let (sigma) be any continuous discriminant function (e.g. sigmoid function). The finite sum of the form

[sum_{i=1}^{N} alpha_{j}sigmaleft(sum_{j =1} beta_{j}^{T}x + theta_{j}right)]

are dense in (C(I_{n})). In other words, given any (F in C(I_{n})) and (varepsilon >0), there is a sum of, (F_{Theta}(x)) of the above form, for which

[max_{x in I_n}vert F_{Theta}(x) – F(x) vert < varepsilon,]

where (Theta:= left{alpha_{j}, theta_{j}, beta_{j}right}_{j = 1}^{N}).

In the following, I will offer an explanation for the intuition of neural network approximation. The piecewise constant functions, as a standard approach, can be used to approximate any continuous function shown in Figure 1 (see Theorem 2 for UAT of the Step Function).

Figure 1: The piecewise constant approximation of a continuous function.

Theorem 2 (UAT for Step Function) The finite sum of the form

[sum_{i = 1}^{n-1}C_{n} mathbf{1}( t_{i}leq x <t_{i+1})]

are dense in (C(I_{n})). In other words, given any (F in C(I_{n})) and (varepsilon>0), there is a sum of the above form, denoted by (F_{C}), such that for every (varepsilon >0), there exists ((x_{i})_{i = 1}^{n}) such that (x_{1}<x_{2}<cdots<x_{n}) and (x_{i} in J, forall i in {1, 2, cdots, n}), and

[max_{x in J} vert F_{C}(x) – F(x) vert leq varepsilon,]

where (C:= (C_{i}, x_{i})_{i=1}^{n}).

Another key observation is that the step function can be arbitrarily well approximated by the Sigmoid function with different parameter (beta), i.e., (lim_{beta uparrow infty}sigma( x vert beta, a ) = mathbf{1}(x>a)). Hence it is not difficult to prove the universality of shallow neural networks (Theorem 1).

Figure 2: The approximation of the Sigmoid function to a step function.

# Reference

[1] Gybenko, G., 1989. Approximation by superposition of sigmoidal functions. Mathematics of Control, Signals and Systems, 2(4), pp.303-314.