Summaries for machine learning course (5)

 Classification with logistic regression



Logistic regression:













inner y=1; outer y=0



cost function for logistic regression 


The cost function for linear regression is convex.
However, if you apply the same cost function to logistic regression, it is not convex.

Definition of Convexity

A function f(x)f(x) is convex if for any two points x1x_1 and x2x_2 in its domain and for any λ[0,1]\lambda \in [0,1], the following inequality holds:

f(λx1+(1λ)x2)λf(x1)+(1λ)f(x2)f(\lambda x_1 + (1 - \lambda) x_2) \leq \lambda f(x_1) + (1 - \lambda) f(x_2)

This means that the function does not have multiple local minima—there is only one global minimum, making optimization easier.

Convex vs. Non-Convex Functions

  • Convex Function: Has a bowl-like shape, ensuring gradient-based optimization methods (e.g., gradient descent) reliably converge to the global minimum.
  • Non-Convex Function: Can have multiple local minima, making optimization more difficult since gradient descent might get stuck in a local minimum.






Cost function for logistic regression:

Maximum Likelihood Estimation (MLE) for the Cost Function of Logistic Regression

1. Understanding the Likelihood Function

Logistic regression is used for binary classification, where the target variable yy takes values 0 or 1. The model predicts the probability of y=1y = 1 given input features XX using the sigmoid function:

hθ(x)=11+eθT
h_{\theta}(x) = \frac{1}{1 + e^{-\theta^T x}}

where:
  • hθ(x)h_{\theta}(x) is the probability of y=1y = 1,
  • θ\theta is the vector of parameters,
  • xx is the feature vector.

Given a dataset of mm independent observations (xi,yi)(x_i, y_i), we assume the outputs yiy_i are Bernoulli distributed:

P(yixi;θ)=hθ(xi)yi(1hθ(xi))(1yi)P(y_i | x_i; \theta) = h_{\theta}(x_i)^{y_i} (1 - h_{\theta}(x_i))^{(1 - y_i)}

2. Constructing the Likelihood Function

The likelihood function L(θ) represents the probability of observing the given data under the model:

L(θ)=i=1mP(yixi;θ)L(\theta) = \prod_{i=1}^{m} P(y_i | x_i; \theta)
L(θ)=i=1mhθ(xi)yi(1hθ(xi))(1yi)L(\theta) = \prod_{i=1}^{m} h_{\theta}(x_i)^{y_i} (1 - h_{\theta}(x_i))^{(1 - y_i)}

3. Taking the Log-Likelihood

Since likelihood calculations involve products, it's more convenient to work with the log-likelihood function:

logL(θ)=i=1m[yiloghθ(xi)+(1yi)log(1hθ(xi))]\log L(\theta) = \sum_{i=1}^{m} \left[ y_i \log h_{\theta}(x_i) + (1 - y_i) \log (1 - h_{\theta}(x_i)) \right]

This is the function that logistic regression aims to maximize to find the best parameters θ\theta. However, since optimization is typically done by minimization, we use the negative log-likelihood as the cost function.

4. Cost Function for Logistic Regression

By negating the log-likelihood, we obtain the log loss function, which is the cost function for logistic regression:

     J(θ)=1mi=1m[yiloghθ(xi)+(1yi)log(1hθ(xi))]

J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log h_{\theta}(x_i) + (1 - y_i) \log (1 - h_{\theta}(x_i)) \right]

This function is:
Convex, ensuring a single global minimum.
Differentiable, allows optimization using gradient descent.


Gradient descent for logistic regression





The problem of overfitting

Regularization to reduece overfitting 

what dose overfitting refer to ?

1. If the model dose not fit the training set well—underfit/high bias
2. fits training set pretty well
3. fits the training set extremely well--overfit/high variance---if your training set were just a little bit different, the function fits could end up totally different.





Addressing overfitting 



option 1:

option 2




option 3
Regularization is for wj, no need to do it for b.



cost function with regularization:
lamda=0, if need to minimize the costfunction, it causes overfitting
lamda=10^10, if need to minimize the cost function, wj should be closed to 0 and underfit



Regularized linear regression:








Regularization shrinks wj a little bit compared to the usual update




Regularized logistic regression:






Comments

Popular posts from this blog

Analysis of Repeated Measures Data using SAS (1)

Medical information for Melanoma, Merkel cell carcinoma and tumor mutation burden

Four essential statistical functions for simulation in SAS