Feedforward
Activations
Linear
Python
Copy
return x
Sigmoid
Python
Copy
return 1 / (1 + np.exp(-x))
Tanh
Python
Copy
return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
ReLU (stand for rectified linear unit)
Python
Copy
return np.where(x > 0, x, 0)
Softmax: , applied on a entire array of values
Python
Copy
return np.exp(x) / np.sum(np.exp(x))
Loss
Continuous: MSE
Python
Copy
return np.mean(np.square(p - y))
Continuous: MAE
Python
Copy
return np.mean(np.abs(p - y))
Categorical: Binary Cross-Entropy
Python
Copy
return -np.mean(y * log(p) + (1 - y) * log(1 - p))
Categorical: Category Cross-Entropy
Python
Copy
return np.mean(np.sum(y * log(p), axis=0), axis=1)
Backpropagation
Batch size: the incremental contribution of a greater number of data points while calculating the loss value would follow the law of diminishing returns
batch size is between 32 and 1024, much smaller compared to the total number of data points.
we will apply gradient descent (after feedforward propagation) using one batch at a time until we exhaust all data points within one epoch of training.
Chain rule
so
then
finally
As we update parameters across all layers, the whole process of updating parameters can be parallelized, enabling core GPU