/
...
/
/
🐣
Chapter 1: Artificial Neural Network Fundamentals
Search
Duplicate
Try Notion
🐣🐣

Chapter 1: Artificial Neural Network Fundamentals

Feedforward

Activations
Linear
Python
Copy
return x
Sigmoid
Python
Copy
return 1 / (1 + np.exp(-x))
Tanh
Python
Copy
return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
ReLU (stand for rectified linear unit)
Python
Copy
return np.where(x > 0, x, 0)
Softmax: s(x)=eixiexis(x)=\frac{e^x_i}{\sum_i e^{x_i}}, applied on a entire array of values
Python
Copy
return np.exp(x) / np.sum(np.exp(x))
Loss
Continuous: MSE
L(p,y)=1mim(piyi)2L(p,y)=\frac{1}{m}\sum_i^m(p_i-y_i)^2
Python
Copy
return np.mean(np.square(p - y))
Continuous: MAE
Python
Copy
return np.mean(np.abs(p - y))
Categorical: Binary Cross-Entropy
L(p,y)=1mimyilog(pi)+(1yi)log(1pi)L(p, y)=-\frac{1}{m}\sum_i^my_ilog(p_i)+(1-y_i)log(1-p_i)
Python
Copy
return -np.mean(y * log(p) + (1 - y) * log(1 - p))
Categorical: Category Cross-Entropy
L(p,y)=1mjCimyilog(pi)L(p,y)=-\frac{1}{m}\sum^C_j\sum^m_iy_ilog(p_i)
Python
Copy
return np.mean(np.sum(y * log(p), axis=0), axis=1)

Backpropagation

Batch size: the incremental contribution of a greater number of data points while calculating the loss value would follow the law of diminishing returns
batch size is between 32 and 1024, much smaller compared to the total number of data points.
we will apply gradient descent (after feedforward propagation) using one batch at a time until we exhaust all data points within one epoch of training.
Chain rule
MSE  Loss(C)=(yy^)2MSE\;Loss(C)=(y-\hat{y})^2
y^=a11w31+a12w32+a13w33\hat{y}=a_{11}*w_{31} + a_{12}*w_{32}+a_{13}*w_{33}
a11=11+eh11a_{11}=\frac{1}{1+e^{-h_{11}}}
h11=x1w11+x2w21h_{11}=x_1*w_{11}+x_2*w_{21}
Lw11=Ly^y^a11a11h11h11w11\frac{\partial L}{\partial w_{11}}=\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial a_{11}} \frac{\partial a_{11}}{\partial h_{11}}\frac{\partial h_{11}}{\partial w_{11}}
so
Ly^=2(yy^)\frac{\partial L}{\partial \hat{y}}=-2(y-\hat{y})
y^a11=w31\frac{\partial \hat{y}}{\partial a_{11}}=w_{31}
a11h11=eh111+eh11=a11(1a11)\frac{\partial a_{11}}{\partial h_{11}}=\frac{e^{-h_{11}}}{1+e^{-h_{11}}}=a_{11}*(1-a_{11})
h11w11=x1\frac{\partial h_{11}}{\partial w_{11}}=x_1
then
Lw11=2(yy^)w31a11(1a11)x1\frac{\partial L}{\partial w_{11}}=-2(y-\hat{y})*w_{31}*a_{11}*(1-a_{11})*x_1
finally
w11=w11lrLw11w_{11}=w_{11}-lr\frac{\partial L}{\partial w_{11}}
As we update parameters across all layers, the whole process of updating parameters can be parallelized, enabling core GPU