Summaries for machine learning course (5)

February 03, 2025

Classification with logistic regression

Logistic regression:

inner y=1; outer y=0

cost function for logistic regression

The cost function for linear regression is convex.
However, if you apply the same cost function to logistic regression, it is not convex.

Definition of Convexity

A function $f(x)$ is convex if for any two points $x_1$ and $x_2$ in its domain and for any $\lambda \in [0,1]$ , the following inequality holds:
$f(\lambda x_1 + (1 - \lambda) x_2) \leq \lambda f(x_1) + (1 - \lambda) f(x_2)$
This means that the function does not have multiple local minima—there is only one global minimum, making optimization easier.

Convex vs. Non-Convex Functions

Convex Function: Has a bowl-like shape, ensuring gradient-based optimization methods (e.g., gradient descent) reliably converge to the global minimum.
Non-Convex Function: Can have multiple local minima, making optimization more difficult since gradient descent might get stuck in a local minimum.

Cost function for logistic regression:

Maximum Likelihood Estimation (MLE) for the Cost Function of Logistic Regression

1. Understanding the Likelihood Function

Logistic regression is used for binary classification, where the target variable $y$ takes values 0 or 1. The model predicts the probability of $y = 1$ given input features $X$ using the sigmoid function:

h_{\theta}(x) = \frac{1}{1 + e^{-\theta^T x}}

where:

$h_{\theta}(x)$ is the probability of $y = 1$ ,
$\theta$ is the vector of parameters,
$x$ is the feature vector.

Given a dataset of $m$ independent observations $(x_i, y_i)$ , we assume the outputs $y_i$ are Bernoulli distributed:

P(y_i | x_i; \theta) = h_{\theta}(x_i)^{y_i} (1 - h_{\theta}(x_i))^{(1 - y_i)}

2. Constructing the Likelihood Function

The likelihood function $L (θ) represents the probability of observing the given data under the model:$

L(\theta) = \prod_{i=1}^{m} P(y_i | x_i; \theta)

L(\theta) = \prod_{i=1}^{m} h_{\theta}(x_i)^{y_i} (1 - h_{\theta}(x_i))^{(1 - y_i)}

3. Taking the Log-Likelihood

Since likelihood calculations involve products, it's more convenient to work with the log-likelihood function:

\log L(\theta) = \sum_{i=1}^{m} \left[ y_i \log h_{\theta}(x_i) + (1 - y_i) \log (1 - h_{\theta}(x_i)) \right]

This is the function that logistic regression aims to maximize to find the best parameters $\theta$ . However, since optimization is typically done by minimization, we use the negative log-likelihood as the cost function.

4. Cost Function for Logistic Regression

By negating the log-likelihood, we obtain the log loss function, which is the cost function for logistic regression:

J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log h_{\theta}(x_i) + (1 - y_i) \log (1 - h_{\theta}(x_i)) \right]

Convex, ensuring a single global minimum.
Differentiable, allows optimization using gradient descent.

Gradient descent for logistic regression

The problem of overfitting

Regularization to reduece overfitting

what dose overfitting refer to ?

1. If the model dose not fit the training set well—underfit/high bias

2. fits training set pretty well

3. fits the training set extremely well--overfit/high variance---if your training set were just a little bit different, the function fits could end up totally different.