Andrew Ng's Machine Learning

I finished watching Andrew Ng’s Machine Learning courses! It’s really nice for beginning.

Overview


  • Supervised Machine Learning: Regression and Classification
    • Linear regression
    • Logistic regression
    • Gradient descent
  • Advanced Learning Algorithms
    • Neural networks
    • Decision trees
    • Advice for ML
  • Unsupervised Learning, Recommenders, Reinforcement learning
    • Clustering
    • Anomaly detection
    • Collaborative filtering
    • Content-based filtering
    • Reinforcement learning

1️⃣ Machine Learning


🌟 Algorithms

Supervised learning

  • Maps input x to output y
  • Learns from being given “right answes”
  • 2 major types:
    • Regression:
      • Predict a number
      • Infinitely many possible outputs
    • classification:
      • Predict categories(don’t have to be numbers)
      • Small finite numbers of possible outputs like (0,1)
      • When two or more inputs: find the boundary line

Unsupervised learning

  • Supervised learning learn from data labeled with the right answers;
    Unsupervised learning find something interesting in unlabeled data

  • Input x, but not output labels y. It finds structure in the data

  • Main types:

    • Clustering
    • Anomaly detection
    • Dimensionality reduction

Reinforcement learning

🌟 Supervised Machine Learning Models

Linear regression (one variable)

  • Cost function
  • Gradient descent
  • Learning rate

Multiple linear regression (variables)

  • Vectorization
  • Feature scaling (normalization)
    • can make gradient descent method faster
  • Feature engineering
  • Polynomial regression

Logistic regression → classification

  • Sigmoid function
  • Decision boundary
  • Apply gradient descent to logistic regression
  • Loss & cost
    • To make logistic cost function convex, use -log and -log(1-) (loss function) and sum
  • Compare linear regression & logistic regression
    • Same
      • Learning curve
      • Vectorized implementation
      • Feature scaling
    • Different
      • Linear regression: $f=\vec{w} \cdot \vec{x}+b$
      • Logistic regression: $f = \frac{1} {1+e^{-\vec{w} \cdot \vec{x} + b}}$
  • Overfitting
    • Addressing overfitting
      • Collect more training example
      • Select features
      • Regularization:reduce the size of parameters $w_j$
    • Regularization

2️⃣ Neural Network in TensorFlow


Numpy arrays & Tensors

1
2
3
4
5
6
7
8
x = np.array([200.0, 17.0])
layer_1 = Dense(units=3, activation='sigmoid')
a1 = layer_1(x)
# print(a1)
# -> tf.Tensor([[0.2 0.7 0.3]], shape=(1,3), dtype=float32
a1.numpy()
# print(a1)
# -> array(…)

Building a neural network architecture

1
2
3
4
5
6
7
8
9
model = Sequential([
Dense(units=25, activation="sigmoid"),
Dense(units=15, activation="sigmoid"),
Dense(units=1, activation="sigmoid")])
x = np.array(...)
y = np.array(...)
model.compile(...)
model.fit(x, y)
model.predict(x_new)

Training a neural network in TensorFlow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# Step 1: specify the model
model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=1, activation='sigmoid')
)]

# Step 2: compile the model
from tensorfolw.keras.losses import BinaryCrossentropy
model.compile(loss=BinaryCrossentrpy())

# Step 3: train the model
model.fit(X, Y, epochs=100)

Choosing an activation function

Outer layer

For binary classification, use sigmoid

For regression (+/-) , use linear

For regression (+) , use ReLU

Hidden layer

ReLU is the most common choice

Multi-class classification: Softmax

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='softmax')
)]

from tensorfolw.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentrpy())

model.fit(X, Y, epochs=100)

But don’t use this version!

More numerically accurate implementation

For logistic loss

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=1, activation='linear')
)]

from tensorfolw.keras.losses import BinaryCrossentropy
model.compile(loss=BinaryCrossentrpy(from_logits=True))

model.fit(X, Y, epochs=100)

For softmax

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='linear')
)]

from tensorfolw.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentrpy(from_logits=True))

model.fit(X, Y, epochs=100)

# predict
logits = model(X)
f_x = tf.nn.softmax(logits)

Adam algorithm

Adam: Adaptive Moment estimation

Changing the learning rate adaptively; faster

1
2
3
4
5
6
7
8
9
10
11
12
13
# Model
model = Sequential([
Dense(units=25, activation='sigmoid')
Dense(units=15, activation='sigmoid')
Dense(units=10, activation='linear')
)]

# Compile
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

# Fit
model.fit(X, Y, epochs=100)

Convolutional layer

Each neuron only lokoks at part of the previous layer’s input

Faster computation; need less training data (less prone to overfitting)

Evaluating a model

Split the data into 3 parts:

  • Training set
  • Cross validation set (validation / development / dev set)
  • Test set

When you’re making decisions about the model, only use training set and dev set, and not look at the test set at all. After you decide the final model, evaluate it on the test set

Bias / variance

Bias is the gap of training error and baseline

Variance is the gap of cross validation error and baseline

Regularization

1
layer = Dense(units=25, activation="relu", kernel_regularizer=L2(0.01))

Transfer learning

  1. Download neural network parameters pretrained on a large dataset with same input type (e.g., images, audio, text) as your application
  2. Fine tune the network on your own data

Full cycle of a machine learning project

  1. Define project
  2. Define and collect data
  3. Training, error analysis & iterative improvement
  4. Deploy, monitor and maintain system

Confusion matrix, precision, recall

Precision = true positives / (predicted positives) = true pos / (true pos + false pos)

Recall = true positives / (actual positives) = true pos / ( true pos + false neg)

Tradeoff:
High threshold → higher precision, lower recall;
Low threshold → lower precision, higher recall

F1 score = $\frac{1}{\frac{1}{2}(\frac{1}{P}+\frac{1}{R})}$ calculates an average that pays more attention to whichever is lower (the harmonic mean of P and R)

Decision tree model

Decision tree predicts category

Regression tree predicts number

Tree ensembles (more robust; use sampling with replacement) :

  • Random forest
  • XGBoost (eXtreme Gradient Boosting): more likely to pick examples that the previously trained trees misclassify
1
2
3
4
5
6
7
8
9
10
11
# Classification
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Regression
from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Decision trees vs Neural networks:

Decision trees Neural networks
Working well on which data Structured (tabular) data Structured data and unstructured (images, audio, text) data
Speed Fast Maybe slower
Transfer learning No Yes
Interpretability Maybe yes No
Building a system of multiple models Hard Easy

3️⃣ Unsupervised Learning


Clustering

K-means

Step 1: Randomly initialize K cluster centroids;

Step 2: Assign each point to its closest centroid;

Step 3: Recompute the centroids utill they converge.

How to choose K?

  • Elbow method
  • Evaluate K-means based on how well it performs for later purpose

Anomaly detection

Step 1: Choose n features that you think might be indicative of anomalous examples;

Step 2: Fit a Gaussian distribution;

Step 3: Given new example x, compute p(x), anomaly if p(x) < epsilon

Anomaly detection Supervised learning
Large number of negative (y=0) examples, very small number of positive examples (y=1) Large number of positive and negative examples
Future positive examples: previously unseen, brand new Future positive examples: previously seen, similar

Collaborative filtering

  • Finds related items
  • Automatically finds the derivative
1
2
3
4
5
6
7
8
9
10
11
12
13
w = tf.Variable(3.0)
x = 1.0
y = 1.0
alpha = 0.01

# Auto Diff
iteration = 30
for iter in range(iteration):
with tf.GradientTape() as tape:
fwb = w*x
costJ = (fwb - y)**2
[dJdw] = tape.gradient(costJ, [w])
w.assign_add(-alpha*dJdw)

You can also use the Adam optimization algorithm

Content-based filtering

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
user_NN = tf.keras.models.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(32)
])

item_NN = tf.keras.models.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(32)
])

input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

output = tf.keras.layers.Dot(axes=1)([vu, vm])

model = Model([input_user, input_item], output)

cost_fn = tf.keras.losses.MeanSquaredError()
Collaborative filtering Content-based filtering
Recommend items to you based on Ratings of users who gave similar ratings as you Features of users and items to find a good match

Mean normalization makes the algorithm faster and better

Reinforcement Learning

Reward function

“Good dog”, “bad dog”

Return

The return is the sum of the rewards, weighted by the discount factor

It depends on the actions you take

Policy

A policy is a function pi(s) = a mapping from states to actions

The goal of reinforcement learning is to find a policy pi so as to maximize the return

Markov Decision Process (MDP)

“Markov” refers to that the future only depends on the current state, not on anything that might have occurred prior to getting to the current state

In MDP, the future only depends on where you are now, not on how you got here

State action value function (Q-function)

Q(s, a) = return, if you:

  • Starts in state s
  • Take action a (once)
  • Then behave optimally after that

The best possible return from state s is maxQ(s, a)

The best possible action in state s is the action a that gives maxQ(s, a)

Bellman equation

Q(s, a) = R(s) + gamma*maxQ(s’, a’)

  • R(s) is the reward you get right away
  • Gamma*maxQ(s’, a’) is the reward from behaving optimally starting from the next state s’

DQ (Deep Q) algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Initialize neural network randomly as guess of Q(s,a).
Repeat{

Take actions in the lunar lander. Get (s, a, R(s), s’).

Store 10000 most recent (s, a, R(s), s’) tuples.

Train neural network:

Create training set of 10000 examples using

x = (s, a) and y = R(s) + gamma * maxQ(s’, a’)

Train Q_new such that Q_new (s, a) = y.

Set Q = Q_new.
}

Refinement

  • Epsilon-greedy policy

Eg. epsilon = 0.05. It means

With probability 0.95, pick the action a that maximizes Q(s, a),greedy, “exploitation”

With probability 0.05, pock an option a randomly, “exploration”

It’s good to set epsilon high at the beginning and gradually decrease

  • Mini batch

Faster

  • Soft update

Change “set Q = Q_new” (i.e. W,B = W_new,B_new) to

W = 0.01W_new + 0.99W
B = 0.01B_new + 0.99B

Fundamentals of Electronics Mar 2023

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×