I finished watching Andrew Ng’s Machine Learning courses! It’s really nice for beginning.
Overview
- Supervised Machine Learning: Regression and Classification
- Linear regression
- Logistic regression
- Gradient descent
- Advanced Learning Algorithms
- Neural networks
- Decision trees
- Advice for ML
- Unsupervised Learning, Recommenders, Reinforcement learning
- Clustering
- Anomaly detection
- Collaborative filtering
- Content-based filtering
- Reinforcement learning
1️⃣ Machine Learning
🌟 Algorithms
Supervised learning
- Maps input x to output y
- Learns from being given “right answes”
- 2 major types:
- Regression:
- Predict a number
- Infinitely many possible outputs
- classification:
- Predict categories(don’t have to be numbers)
- Small finite numbers of possible outputs like (0,1)
- When two or more inputs: find the boundary line
- Regression:
Unsupervised learning
Supervised learning learn from data labeled with the right answers;
Unsupervised learning find something interesting in unlabeled dataInput x, but not output labels y. It finds structure in the data
Main types:
- Clustering
- Anomaly detection
- Dimensionality reduction
Reinforcement learning
🌟 Supervised Machine Learning Models
Linear regression (one variable)
- Cost function
- Gradient descent
- Learning rate
Multiple linear regression (variables)
- Vectorization
- Feature scaling (normalization)
- can make gradient descent method faster
- Feature engineering
- Polynomial regression
Logistic regression → classification
- Sigmoid function
- Decision boundary
- Apply gradient descent to logistic regression
- Loss & cost
- To make logistic cost function convex, use -log and -log(1-) (loss function) and sum
- Compare linear regression & logistic regression
- Same
- Learning curve
- Vectorized implementation
- Feature scaling
- Different
- Linear regression: $f=\vec{w} \cdot \vec{x}+b$
- Logistic regression: $f = \frac{1} {1+e^{-\vec{w} \cdot \vec{x} + b}}$
- Same
- Overfitting
- Addressing overfitting
- Collect more training example
- Select features
- Regularization:reduce the size of parameters $w_j$
- Regularization
- Addressing overfitting
2️⃣ Neural Network in TensorFlow
Numpy arrays & Tensors
1 | x = np.array([200.0, 17.0]) |
Building a neural network architecture
1 | model = Sequential([ |
Training a neural network in TensorFlow
1 | import tensorflow as tf |
Choosing an activation function
Outer layer
For binary classification, use sigmoid
For regression (+/-) , use linear
For regression (+) , use ReLU
Hidden layer
ReLU is the most common choice
Multi-class classification: Softmax
1 | import tensorflow as tf |
But don’t use this version!
More numerically accurate implementation
For logistic loss
1 | import tensorflow as tf |
For softmax
1 | import tensorflow as tf |
Adam algorithm
Adam: Adaptive Moment estimation
Changing the learning rate adaptively; faster
1 | # Model |
Convolutional layer
Each neuron only lokoks at part of the previous layer’s input
Faster computation; need less training data (less prone to overfitting)
Evaluating a model
Split the data into 3 parts:
- Training set
- Cross validation set (validation / development / dev set)
- Test set
When you’re making decisions about the model, only use training set and dev set, and not look at the test set at all. After you decide the final model, evaluate it on the test set
Bias / variance
Bias is the gap of training error and baseline
Variance is the gap of cross validation error and baseline
Regularization
1 | layer = Dense(units=25, activation="relu", kernel_regularizer=L2(0.01)) |
Transfer learning
- Download neural network parameters pretrained on a large dataset with same input type (e.g., images, audio, text) as your application
- Fine tune the network on your own data
Full cycle of a machine learning project
- Define project
- Define and collect data
- Training, error analysis & iterative improvement
- Deploy, monitor and maintain system
Confusion matrix, precision, recall
Precision = true positives / (predicted positives) = true pos / (true pos + false pos)
Recall = true positives / (actual positives) = true pos / ( true pos + false neg)
Tradeoff:
High threshold → higher precision, lower recall;
Low threshold → lower precision, higher recall
F1 score = $\frac{1}{\frac{1}{2}(\frac{1}{P}+\frac{1}{R})}$ calculates an average that pays more attention to whichever is lower (the harmonic mean of P and R)
Decision tree model
Decision tree predicts category
Regression tree predicts number
Tree ensembles (more robust; use sampling with replacement) :
- Random forest
- XGBoost (eXtreme Gradient Boosting): more likely to pick examples that the previously trained trees misclassify
1 | # Classification |
Decision trees vs Neural networks:
Decision trees | Neural networks | |
---|---|---|
Working well on which data | Structured (tabular) data | Structured data and unstructured (images, audio, text) data |
Speed | Fast | Maybe slower |
Transfer learning | No | Yes |
Interpretability | Maybe yes | No |
Building a system of multiple models | Hard | Easy |
3️⃣ Unsupervised Learning
Clustering
K-means
Step 1: Randomly initialize K cluster centroids;
Step 2: Assign each point to its closest centroid;
Step 3: Recompute the centroids utill they converge.
How to choose K?
- Elbow method
- Evaluate K-means based on how well it performs for later purpose
Anomaly detection
Step 1: Choose n features that you think might be indicative of anomalous examples;
Step 2: Fit a Gaussian distribution;
Step 3: Given new example x, compute p(x), anomaly if p(x) < epsilon
Anomaly detection | Supervised learning |
---|---|
Large number of negative (y=0) examples, very small number of positive examples (y=1) | Large number of positive and negative examples |
Future positive examples: previously unseen, brand new | Future positive examples: previously seen, similar |
Recommended Systems
Collaborative filtering
- Finds related items
- Automatically finds the derivative
1 | w = tf.Variable(3.0) |
You can also use the Adam optimization algorithm
Content-based filtering
1 | user_NN = tf.keras.models.Sequential([ |
Collaborative filtering | Content-based filtering | |
---|---|---|
Recommend items to you based on | Ratings of users who gave similar ratings as you | Features of users and items to find a good match |
Mean normalization makes the algorithm faster and better
Reinforcement Learning
Reward function
“Good dog”, “bad dog”
Return
The return is the sum of the rewards, weighted by the discount factor
It depends on the actions you take
Policy
A policy is a function pi(s) = a mapping from states to actions
The goal of reinforcement learning is to find a policy pi so as to maximize the return
Markov Decision Process (MDP)
“Markov” refers to that the future only depends on the current state, not on anything that might have occurred prior to getting to the current state
In MDP, the future only depends on where you are now, not on how you got here
State action value function (Q-function)
Q(s, a) = return, if you:
- Starts in state s
- Take action a (once)
- Then behave optimally after that
The best possible return from state s is maxQ(s, a)
The best possible action in state s is the action a that gives maxQ(s, a)
Bellman equation
Q(s, a) = R(s) + gamma*maxQ(s’, a’)
- R(s) is the reward you get right away
- Gamma*maxQ(s’, a’) is the reward from behaving optimally starting from the next state s’
DQ (Deep Q) algorithm
1 | Initialize neural network randomly as guess of Q(s,a). |
Refinement
- Epsilon-greedy policy
Eg. epsilon = 0.05. It means
With probability 0.95, pick the action a that maximizes Q(s, a),greedy, “exploitation”
With probability 0.05, pock an option a randomly, “exploration”
It’s good to set epsilon high at the beginning and gradually decrease
- Mini batch
Faster
- Soft update
Change “set Q = Q_new” (i.e. W,B = W_new,B_new) to
W = 0.01W_new + 0.99W
B = 0.01B_new + 0.99B
Comments