Andrew Ng's Machine Learning

Apr 11 2023 Learning>AI 11 minutes read (About 1687 words)

I finished watching Andrew Ng’s Machine Learning courses! It’s really nice for beginning.

Overview

Supervised Machine Learning: Regression and Classification
- Linear regression
- Logistic regression
- Gradient descent
Advanced Learning Algorithms
- Neural networks
- Decision trees
- Advice for ML
Unsupervised Learning, Recommenders, Reinforcement learning
- Clustering
- Anomaly detection
- Collaborative filtering
- Content-based filtering
- Reinforcement learning

1️⃣ Machine Learning

🌟 Algorithms

Supervised learning

Maps input x to output y
Learns from being given “right answes”
2 major types:
- Regression:
  - Predict a number
  - Infinitely many possible outputs
- classification:
  - Predict categories(don’t have to be numbers)
  - Small finite numbers of possible outputs like (0,1)
  - When two or more inputs: find the boundary line

Unsupervised learning

Supervised learning learn from data labeled with the right answers;
Unsupervised learning find something interesting in unlabeled data
Input x, but not output labels y. It finds structure in the data
Main types:
- Clustering
- Anomaly detection
- Dimensionality reduction

Reinforcement learning

🌟 Supervised Machine Learning Models

Linear regression (one variable)

Cost function
Gradient descent
Learning rate

Multiple linear regression (variables)

Vectorization
Feature scaling (normalization)
- can make gradient descent method faster
Feature engineering
Polynomial regression

Logistic regression → classification

Sigmoid function
Decision boundary
Apply gradient descent to logistic regression
Loss & cost
- To make logistic cost function convex, use -log and -log(1-) (loss function) and sum
Compare linear regression & logistic regression
- Same
  - Learning curve
  - Vectorized implementation
  - Feature scaling
- Different
  - Linear regression: $f=\vec{w} \cdot \vec{x}+b$
  - Logistic regression: $f = \frac{1} {1+e^{-\vec{w} \cdot \vec{x} + b}}$
Overfitting
- Addressing overfitting
  - Collect more training example
  - Select features
  - Regularization：reduce the size of parameters $w_j$
- Regularization

2️⃣ Neural Network in TensorFlow

Numpy arrays & Tensors

x = np.array([200.0, 17.0])
layer_1 = Dense(units=3, activation='sigmoid')
a1 = layer_1(x)
# print(a1)
# -> tf.Tensor([[0.2 0.7 0.3]], shape=(1,3), dtype=float32
a1.numpy()
# print(a1)
# -> array(…)

Building a neural network architecture

model = Sequential([
	Dense(units=25, activation="sigmoid"),
	Dense(units=15, activation="sigmoid"),
	Dense(units=1, activation="sigmoid")])
x = np.array(...)
y = np.array(...)
model.compile(...)
model.fit(x, y)
model.predict(x_new)

Training a neural network in TensorFlow

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# Step 1: specify the model
model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=1, activation='sigmoid')
)]

# Step 2: compile the model
from tensorfolw.keras.losses import BinaryCrossentropy
model.compile(loss=BinaryCrossentrpy())

# Step 3: train the model
model.fit(X, Y, epochs=100)

Choosing an activation function

Outer layer

For binary classification, use sigmoid

For regression (+/-) , use linear

For regression (+) , use ReLU

Hidden layer

ReLU is the most common choice

Multi-class classification: Softmax

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='softmax')
)]

from tensorfolw.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentrpy())

model.fit(X, Y, epochs=100)

But don’t use this version!

More numerically accurate implementation

For logistic loss

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=1, activation='linear')
)]

from tensorfolw.keras.losses import BinaryCrossentropy
model.compile(loss=BinaryCrossentrpy(from_logits=True))

model.fit(X, Y, epochs=100)

For softmax

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
Dense(units=25, activation='relu')
Dense(units=15, activation='relu')
Dense(units=10, activation='linear')
)]

from tensorfolw.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentrpy(from_logits=True))

model.fit(X, Y, epochs=100)

# predict
logits = model(X)
f_x = tf.nn.softmax(logits)

Adam algorithm

Adam: Adaptive Moment estimation

Changing the learning rate adaptively; faster

# Model
model = Sequential([
Dense(units=25, activation='sigmoid')
Dense(units=15, activation='sigmoid')
Dense(units=10, activation='linear')
)]

# Compile
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

# Fit
model.fit(X, Y, epochs=100)

Convolutional layer

Each neuron only lokoks at part of the previous layer’s input

Faster computation; need less training data (less prone to overfitting)

Evaluating a model

Split the data into 3 parts:

Training set
Cross validation set (validation / development / dev set)
Test set

When you’re making decisions about the model, only use training set and dev set, and not look at the test set at all. After you decide the final model, evaluate it on the test set

Bias / variance

Bias is the gap of training error and baseline

Variance is the gap of cross validation error and baseline

Regularization

1	layer = Dense(units=25, activation="relu", kernel_regularizer=L2(0.01))

Transfer learning

Download neural network parameters pretrained on a large dataset with same input type (e.g., images, audio, text) as your application
Fine tune the network on your own data

Full cycle of a machine learning project

Define project
Define and collect data
Training, error analysis & iterative improvement
Deploy, monitor and maintain system

Confusion matrix, precision, recall

Precision = true positives / (predicted positives) = true pos / (true pos + false pos)

Recall = true positives / (actual positives) = true pos / ( true pos + false neg)

Tradeoff:
High threshold → higher precision, lower recall;
Low threshold → lower precision, higher recall

F1 score = $\frac{1}{\frac{1}{2}(\frac{1}{P}+\frac{1}{R})}$ calculates an average that pays more attention to whichever is lower (the harmonic mean of P and R)

Decision tree model

Decision tree predicts category

Regression tree predicts number

Tree ensembles (more robust; use sampling with replacement) :

Random forest
XGBoost (eXtreme Gradient Boosting): more likely to pick examples that the previously trained trees misclassify

# Classification
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Regression
from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Decision trees vs Neural networks:

	Decision trees	Neural networks
Working well on which data	Structured (tabular) data	Structured data and unstructured (images, audio, text) data
Speed	Fast	Maybe slower
Transfer learning	No	Yes
Interpretability	Maybe yes	No
Building a system of multiple models	Hard	Easy

3️⃣ Unsupervised Learning

Clustering

K-means

Step 1: Randomly initialize K cluster centroids;

Step 2: Assign each point to its closest centroid;

Step 3: Recompute the centroids utill they converge.

How to choose K?

Elbow method
Evaluate K-means based on how well it performs for later purpose

Anomaly detection

Step 1: Choose n features that you think might be indicative of anomalous examples;

Step 2: Fit a Gaussian distribution;

Step 3: Given new example x, compute p(x), anomaly if p(x) < epsilon

Anomaly detection	Supervised learning
Large number of negative (y=0) examples, very small number of positive examples (y=1)	Large number of positive and negative examples
Future positive examples: previously unseen, brand new	Future positive examples: previously seen, similar

Recommended Systems

Collaborative filtering

Finds related items
Automatically finds the derivative

w = tf.Variable(3.0)
x = 1.0
y = 1.0
alpha = 0.01

# Auto Diff
iteration = 30
for iter in range(iteration):
	with tf.GradientTape() as tape:
		fwb = w*x
		costJ = (fwb - y)**2
	[dJdw] = tape.gradient(costJ, [w])
	w.assign_add(-alpha*dJdw)

You can also use the Adam optimization algorithm

Content-based filtering

user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(32)
])

item_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(32)
])

input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

output = tf.keras.layers.Dot(axes=1)([vu, vm])

model = Model([input_user, input_item], output)

cost_fn = tf.keras.losses.MeanSquaredError()

	Collaborative filtering	Content-based filtering
Recommend items to you based on	Ratings of users who gave similar ratings as you	Features of users and items to find a good match

Mean normalization makes the algorithm faster and better

Reinforcement Learning

Reward function

“Good dog”, “bad dog”

Return

The return is the sum of the rewards, weighted by the discount factor

It depends on the actions you take

Policy

A policy is a function pi(s) = a mapping from states to actions

The goal of reinforcement learning is to find a policy pi so as to maximize the return

Markov Decision Process (MDP)

“Markov” refers to that the future only depends on the current state, not on anything that might have occurred prior to getting to the current state

In MDP, the future only depends on where you are now, not on how you got here

State action value function (Q-function)

Q(s, a) = return, if you:

Starts in state s
Take action a (once)
Then behave optimally after that

The best possible return from state s is maxQ(s, a)

The best possible action in state s is the action a that gives maxQ(s, a)

Bellman equation

Q(s, a) = R(s) + gamma*maxQ(s’, a’)

R(s) is the reward you get right away
Gamma*maxQ(s’, a’) is the reward from behaving optimally starting from the next state s’

DQ (Deep Q) algorithm

Initialize neural network randomly as guess of Q(s,a).
Repeat{

    Take actions in the lunar lander. Get (s, a, R(s), s’).

    Store 10000 most recent (s, a, R(s), s’) tuples.

    Train neural network:

        Create training set of 10000 examples using

        x = (s, a) and y = R(s) + gamma * maxQ(s’, a’)

        Train Q_new such that Q_new (s, a) = y.

    Set Q = Q_new. 
}

Refinement

Epsilon-greedy policy

Eg. epsilon = 0.05. It means

With probability 0.95, pick the action a that maximizes Q(s, a),greedy, “exploitation”

With probability 0.05, pock an option a randomly, “exploration”

It’s good to set epsilon high at the beginning and gradually decrease

Mini batch

Faster

Soft update

Change “set Q = Q_new” (i.e. W,B = W_new,B_new) to

W = 0.01W_new + 0.99W
B = 0.01B_new + 0.99B

#Machine Learning