2.2.2 Logistic Regression

Información

Logistic Regression is a simple yet powerful classification algorithm used for binary classification problems. It estimates the probability that an instance belongs to a particular class. Unlike linear regression which predicts continuous values, logistic regression predicts categorical outcomes by applying a logistic function to model the probability of the target variable.

Logistic Regression

What is Logistic Regression?

Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict the probability of a binary response based on one or more predictor variables (features). The logistic function (or sigmoid function) is used to model the probability of the default class.

The logistic function is defined as:

σ(x) = 1 / (1 + e^(-x))

where e is the base of the natural logarithm, and x is the input to the function.

Imagine you want to predict whether a person will buy a product based on their age and estimated salary. Here, buying a product (yes or no) is the dependent variable, and age and estimated salary are the independent variables.

To understand logistic regression better, let's visualize it with a graph. Suppose we have a scatter plot of age vs. purchasing decision. Logistic regression fits a curve (an S-shaped logistic curve) to this data, representing the probability of purchasing the product.

Vídeo

https://www.youtube.com/embed/M_mGjdzatSo?si=hEJiA4KqvxxqfJXP" allowfullscreen="allowfullscreen">

Practical exercise

Logistic Regression for Email Spam Classification

In this practical exercise, we will use Logistic Regression to classify whether an email is spam or not. We will load an external dataset with email features and apply machine learning techniques to build a classification model.

Tools and Frameworks Used

For this implementation, we utilize the following tools and frameworks:

Pandas: For handling and manipulating data efficiently.
NumPy: Used for numerical operations and handling arrays.
Scikit-learn: Provides essential tools for machine learning, including logistic regression, train-test splitting, and evaluation metrics.
CountVectorizer: Converts text data into a numerical feature representation, which is crucial for training machine learning models.

Step 1: Import Necessary Libraries

First, we import the necessary libraries for data manipulation, model building, and evaluation.


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

Step 2: Load and Explore the Dataset

Instead of creating a synthetic dataset, we will load an external dataset from a CSV file. This dataset contains email text and a binary label indicating whether the email is spam (1) or not (0).


# Load dataset
df = pd.read_csv("spam_email_dataset.csv")

# Display first few rows
df.head()

Step 3: Convert Text Data into Numerical Features

We will use a CountVectorizer to transform the email text into a numerical format suitable for machine learning models.


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Email'])
y = df['Spam']

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train a Logistic Regression Model

Now, we train a Logistic Regression model using the training dataset.


model = LogisticRegression()
model.fit(X_train, y_train)

Step 5: Evaluate the Model

We will make predictions on the test set and evaluate the performance using accuracy and a classification report.


y_pred = model.predict(X_test)

# Print evaluation metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))

Conclusion

In this exercise, we successfully implemented a logistic regression model for spam classification. By converting email text into numerical representations, we trained a machine learning model to distinguish spam from legitimate emails using a more reliable dataset.

Jupyter Notebook

You can run this exercise in this Notebook

Licenciado baixo a Licenza Creative Commons Atribución Compartir igual 4.0