Basis

Optimization method

Stochestic gradient descent (standard)
Corss-entropy descent
Quadratic descent

Cost functions

Cross-entropy loss (used with sklearn-MLP Classifier)
Quadratic cost
Stochastic gradient cost

Unsupervised Learning

Clustering

K-means clustering
- Assign a centroid to each group
- K is the desired number of clusters
Pricinpal Component Analysis (PCA)
- Find central axes of data
Singular Value Decomposition (SVD)
- Break down data into composite parts

Anomaly Detection

K-means clustering
- Search for outliers in means

Supervised Learning

Regression

Make predictions on a continuous set of real numbers from a set of input features

Linear Regression
- Ex) Support Vector Machine (SVM)
- Ex) Ridge/Lasso regressors (岭回归)
Tree-based regressors
- Ex) Random Forest Regressor
Gradient-boosting algorithms
- Ex) ADA Boost

Classification

Classify a group of events or objects into two or more groups

Linear Discriminators
- Ex) Support Vector Machine (SVM)
- Ex) Linear Deiscriminant Analysis
Tree-based classifiers
- Ex) Random Forest Classifier
Neural Networks
- Ex) Multi-layer Perceptron

Model Selection

Across Different Types of Models

Supervised vs. Unsupervised Learning
- Regression vs. Classification (Supervised)
- Clustering vs. Anomaly Detection (Unsupervised)
Methods: SVM, KNN, Linear Regressors, Neural Networks, Random Forests

Within a Specific Model Type

Hyperparameter tuning: Train/Test/Validation Split
Loss Function Selection and Regularization Techniques

Considerations:

Accuracy
Reliability
Speed
Explainaility (Siplicity)

Learning

Learning is finding out what weights and biaes lead to minimum cost, which is also a process of Backpropagation

Feed Forward: Give input, give output, calculate cost
Backpropagation: Work backwards, calculate gradients for each neuron, one layer at a time
Update Weights: in the opposite direction of increasing “blame”. Decreases cost
Repeat 1-3 until satisfied with cost

Jupyter Notebook: Simple gradient descent 实现了对梯度下降法的简单应用，同时研究了不同的学习速率和学习步数对学习效果的影响

Loss Functions

how bad the guesses are

MSE & SSE Sum Squared Error $SSE=\sum(y^*_i-y_i)^2$ Mean Squared Error $MSE=SSE/y.\text{size}$

Train/Gradient Descent

could be derivatives of loss function

e.g. $SSE=\sum(w_0+w_1x_i-y_i)^2$

then gradient $G=[\frac{\partial SSE}{\partial w_0}, \frac{\partial SSE}{\partial w_1}] =\sum2(w_0+w_1x_i-y_i)[1,x_i]$

Update Weights

通常会以随机的方式初始化 weights，然后根据 gradient 更新

# 首先 Normalize gradient
gradient = gradient/np.sqrt(sum([g**2 for g in gradient]))
# 再乘上学习率（0.0001-0.01）
gradient *= lr
# 最后更新到 weights 中
weights -= gradient

Document Information

Author: Zeka Lee
Link: https://zhekaili.github.io/0004/01/01/Basis/
Copyright: 自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

Zeka's Blog