Training A Neural Network¶

Neural network(NN) has many applications. It has large amount of parameters we can train. Usually we train a NN to minimize loss function and make it perform well on some tasks. In this notebook, we are going to walk through a simple NN that can classify data into binary classes.

In [1]:

# import libraries
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

preliminary¶

A neural network usually has many hidden layers and weights matrix as connection. Each unit must pass through activation function to decide fire or not. $$ H_2 = \sigma (H_1 * W_1) $$ $H_1$ and $H_2$ are hidden layers and $W_1$ is weight matrix and $\sigma$ is activation function. $\sigma$ is applied element-wisely.

There are many choices for activation functions, i.e., Sigmoid: $\sigma(x) = \frac{1}{1+e^{-x}} $
The higher output, the higher probability of being 'activated'.

In [2]:

def sigmoid(x):
    return 1. / (1+np.exp(-x))


# a plot sigmoid function
line = np.linspace(-5, 5, 100)
plt.plot(line, sigmoid(line), '.')
plt.ylim(0, 1)
plt.xlabel("input")
plt.ylabel("output")
plt.title("sigmoid")
plt.show()

Steepest gradient descent:
Usually we go a little step to update weight matrix using gradient: $W_1 = W_1 - \alpha \cdot \frac{d L}{d W_1}$. This update rule is justified by steepest gradient descent.

$W_1$ update using derivative $\frac{d L}{d W_1}$(or gradient):
Firstly, we specify our notations:
$H_1^{(i)(j)}$: $i$ th row $j$ th colume in $H_1$, $I$ rows X $J$ columes in total;
$W_1^{(j)(k)}$: $j$ th row $k$ th colume in $W_1$, $J$ rows X $K$ columes in total;
$H_2^{(i)(k)}$: $i$ th row $k$ th colume in $H_2$, $I$ rows X $K$ columes in total;
Let's start by partial derivative of certain element of $W_1$, say $\frac{\partial L}{\partial W_1^{(j)(k)}}$. According to chain rule:
$$ \frac{\partial L}{\partial W_1^{(j)(k)}} = \frac{\partial H_2}{\partial W_1^{(j)(k)}} \cdot \frac{\partial L}{\partial H_2} $$
So if $W_1^{(j)(k)}$ is changed, $H_2$ will also change; As $H_2$ has changed, so does $L$. We want to sum up all such changes in $L$, our objective is to minimize it:
$$ \frac{\partial L}{\partial W_1^{(j)(k)}} = \sum_{i=1}^{I} \frac{\partial H_2^{(i)(k)}}{\partial W_1^{(j)(k)}} \frac{\partial L}{\partial H2^{(i)(k)}} \dots \dots (1) $$ Notice that those are coefficeints of linear function:
$$ \sum_{j=1}^{J} H_1^{(i)(j)} W_1^{(j)(k)} = H_2^{(i)(k)} $$ Hence, $\frac{\partial H_2^{(i)(k)}}{\partial W_1^{(j)(k)}} = H_1^{(i)(j)}$. Plug it into equation $(1)$ above:
$$ \frac{\partial L}{\partial W_1^{(j)(k)}} = \sum_{i=1}^{I} H_1^{(i)(j)} \frac{\partial L}{\partial H2^{(i)(k)}} $$ Here is an illustration:

In the image above, $j$ and $k$ is fixed for partial derivative of $W_1^{(j)(k)}$; $\sum$ is over $i$, which can be seen as dot-product between $j$th in $H_1$ and $k$th colume in $\frac{\partial L}{\partial H_2}$.

So far we get $\frac{\partial L}{\partial W_1^{(j)(k)}}$, which is one element in $\frac{\partial L}{\partial W_1}$, but we need the all elements. The interesting thing is that , if we do the same operation to every element using one colume in $H_1$ and one colume in $\frac{\partial L}{\partial H_2}$ and do dot-product, the result is just matrix multiplication between $H_1$ reversed and $\frac{\partial L}{\partial H_2}$: $$ \frac{\partial L}{\partial W_1} = H_1^T * \frac{\partial L}{\partial H_2} $$ Where $*$ is matrix multiplication.
Here is an illustration:
mul

You can shift $j$ and $k$ to get any $\frac{\partial L}{\partial W_1^{(j)(k)}}$ of $\frac{\partial L}{\partial W_1}$.
Recall that we want to update our weight matrix $W_1$ using steepest gradient descent: $W_1 = W_1 - \alpha \cdot \frac{d L}{d W_1}$; Now we only need to choose a proper $\alpha$ and train our network!(not really, actually)

Training and Visualization¶

Step 1: preparing training data.
We are going to train our neural net to do binary classification. So our data consists of 10000 samples from gaussian distribution $g_0\sim N(-6,1)$, and another 10000 samples from $g_1\sim N(6,1)$. We expect our neural net to output probability close to 1 when given new samples from $g_1$ and close to 0 when given samples from $g_0$

In [3]:

# guassian1
g0 = np.random.randn(10000) - 6
g0 = np.expand_dims(g0, 1)
# guassian2
g1 = np.random.randn(10000) + 6
g1 = np.expand_dims(g1, 1)

X = np.vstack([g0, g1])
y = np.vstack([np.zeros_like(g0),\
               np.ones_like(g1)])
# pack data and corresponding label
data_and_label = np.array(list(zip(X, y)))
np.random.shuffle(data_and_label)

X_train = data_and_label[:, 0]
y_train = data_and_label[:, 1]  # true labels

# take a look at firs 5th elements
# in X_train and y_train
print(X_train[:5])
print(y_train[:5])

[[ 5.06516143]
 [-5.83762382]
 [-5.34864671]
 [ 6.01076143]
 [ 3.96512735]]
[[1.]
 [0.]
 [0.]
 [1.]
 [1.]]

Step 2: train our neural net!

In [4]:

# define cross-entropy loss function
def binary_logistic_loss(prediction, label):
    y = prediction
    y_hat = label
    epsilon = 1e-4
    error = - y * np.log(y_hat + epsilon) -\
        (1 - y) * np.log(1 - y_hat + epsilon)
    return error

In [5]:

# initialization
Wxh = np.random.randn(1, 32)
bxh = np.zeros(32)
Whh = np.random.randn(32, 1)
bhh = np.zeros(1)

# set step size
step = 1

# train
for i in range(1000):
    # forward:
    h = X_train.dot(Wxh) + bxh
    h_sigmoid = sigmoid(h)
    # hh means h->out
    out = h_sigmoid.dot(Whh) + bhh
    y_hat = sigmoid(out)
    error = binary_logistic_loss(y_hat, y_train)
    L = np.sum(error)/X_train.shape[0]

    # backforward:
    derror = 1./X_train.shape[0]
    # backprop through cross-entropy loss
    dy_hat2error = y_train * 1./(-y_hat) + \
                  (1 - y_train) * (1./(1-y_hat))
    # backpro through sigmoid
    dout2y_hat = (1 - y_hat) * y_hat
    # chain rule
    dout = dout2y_hat * dy_hat2error * derror
    # dW_1 = H_1.T * H_2
    dWhh = h_sigmoid.T.dot(dout)
    dhh = np.sum(dout, 0)
    dh_sigmoid = dout.dot(Whh.T)
    dh2h_sigmoid = (1 - h_sigmoid) * h_sigmoid
    dh = dh2h_sigmoid * dh_sigmoid
    dWxh = X_train.T.dot(dh)
    dbxh = np.sum(dh, 0)

    # update
    Whh -= step * dWhh
    bhh -= step * dhh
    Wxh -= step * dWxh
    bxh -= step * dbxh

    if i % 100 == 0:
        print("epoch: {}; loss: {}".format(i, L))

epoch: 0; loss: 7.026175126442632
epoch: 100; loss: 0.010359224724804477
epoch: 200; loss: 0.005612259087126668
epoch: 300; loss: 0.003831209976110012
epoch: 400; loss: 0.0028971872797366183
epoch: 500; loss: 0.0023218305238442496
epoch: 600; loss: 0.0019317631092453376
epoch: 700; loss: 0.001649870281502195
epoch: 800; loss: 0.0014366244938535712
epoch: 900; loss: 0.0012696687144646771

Step 3: plot result

In [6]:

# X_predict is 10000 real numbers from [-10, 10]
line = np.linspace(-10, 10, 10000)
X_predict = np.expand_dims(line, 1)
# make prediction
h = X_predict.dot(Wxh) + bxh
h_sigmoid = sigmoid(h)
# hh means h->out
out = h_sigmoid.dot(Whh) + bhh
y_hat = sigmoid(out)

In [7]:

# plot
plt.figure(figsize=(6, 6))
sns.distplot(g0, hist=False, kde=True,
             color='darkblue',
             kde_kws={'linewidth': 1})
sns.distplot(g1, hist=False, kde=True,
             color='green',
             kde_kws={'linewidth': 1})

y = np.random.uniform(0, 0.005, size=g0.shape)
plt.scatter(g0, y, marker=".", linewidths=0.0001,\
            alpha=0.05, c='darkblue')
plt.scatter(g1, y, marker=".", linewidths=0.0001,\
            alpha=0.05, c='green')
plt.plot(line, y_hat, color='r')

plt.legend(["Decision Boundary"])
plt.xlabel('x')
plt.ylabel('density')
plt.title('Histogram')
plt.text(60, .025, r'$\mu,\ \sigma$')
plt.axis([-10, 10, 0, 1])
plt.grid(True)
plt.show()

/Users/ellise/anaconda2/envs/py36/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

From the picture above, we can see that if given samples greater than 2.5, our NN will predict close to 1; if given samples less than -2.5, it will predict close to 0. Hence, if our NN predict larger than 0.5, we can label input sample as coming from $g_1$; if less than 0.5, from $g_1$. In this way, we can give NN new samples as input and do classification.

Supplementary¶

Derivative of Sigmoid function: $$ \begin{align} \sigma(x) = \frac{1}{1+e^{-x}}; \\ \frac{d \sigma(x)}{d x} = \frac{-1}{(1+e^{-x})^2}\cdot(-e^{-x}) \\ = \frac{e^{-x}}{(1+e^{-x})^2} \\ = \frac{1}{(1+e^{-x}} \cdot (1-\frac{1}{1+e^{-x}}) \\ = \sigma(x) \cdot (1-\sigma(x)) \end{align} $$

Cross Entropy: cross-entropy between two distribution $p$ and $q$ is defined as $H[p,q] = -\sum_i p(x_i)\log q(x_i)$. We can expand it as follow:
$$ \begin{align} H[p,q] = -\sum_i p(x_i)\log q(x_i)\\ = \sum_i p(x_i)\log[ \frac{p(x_i)}{q(x_i)} \cdot \frac{1}{p(x_i)}] \\ = - \sum_i p(x_i)\log p(x_i) + \sum_i p(x_i)\log\frac{p(x_i}{q(x_i)} \\ = H[p] + D_{KL}(p||q) \end{align} $$
For classification task where we have correct labels, our label distribution is $p(x)$; we want our prediction distribution $q(x)$ as close to $p(x)$ as possible. Minimizing $H[p,q]$ can be seen as reducing KL-divergence between label distribution and prediction distribution: if two distributions are equal, then $D_{KL}(p||q)=0$ and $H[p,q] = H[p]$ exactly. We can not change $H[p]$.

For binary classification, our $x$ can only take two values: 0, 1 $$ H[p,q] = - p(x=0)\log q(x=0) - p(x=1)\log q(x=1) $$ The loss function for such idea is: $$ L = -y\log \hat y - (1-y) \log (1-\hat y) $$ Where $y$ is true label and $\hat y$ is prediction. $y$ can only be 0 or 1. If 1, the second term is 0 regardless of $\hat y$, only first term remains; if 0, the first term is always 0, only second term remains.

Derivative of Cross Entropy Function: take derivative w.r.t $\hat y$, should be straight forward.