Logistic Regression for Email Spam Classification
In this practical exercise, we will use Logistic Regression to classify whether an email is spam or not. We will load an external dataset with email features and apply machine learning techniques to build a classification model.
Tools and Frameworks Used
For this implementation, we utilize the following tools and frameworks:
- Pandas: For handling and manipulating data efficiently.
- NumPy: Used for numerical operations and handling arrays.
- Scikit-learn: Provides essential tools for machine learning, including logistic regression, train-test splitting, and evaluation metrics.
- CountVectorizer: Converts text data into a numerical feature representation, which is crucial for training machine learning models.
Step 1: Import Necessary Libraries
First, we import the necessary libraries for data manipulation, model building, and evaluation.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
Step 2: Load and Explore the Dataset
Instead of creating a synthetic dataset, we will load an external dataset from a CSV file. This dataset contains email text and a binary label indicating whether the email is spam (1) or not (0).
# Load dataset
df = pd.read_csv("spam_email_dataset.csv")
# Display first few rows
df.head()
Step 3: Convert Text Data into Numerical Features
We will use a CountVectorizer
to transform the email text into a numerical format suitable for machine learning models.
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Email'])
y = df['Spam']
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train a Logistic Regression Model
Now, we train a Logistic Regression model using the training dataset.
model = LogisticRegression()
model.fit(X_train, y_train)
Step 5: Evaluate the Model
We will make predictions on the test set and evaluate the performance using accuracy and a classification report.
y_pred = model.predict(X_test)
# Print evaluation metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))
Conclusion
In this exercise, we successfully implemented a logistic regression model for spam classification. By converting email text into numerical representations, we trained a machine learning model to distinguish spam from legitimate emails using a more reliable dataset.