2.2.3 K-Nearest Neighbors

Información

Welcome to the k-Nearest Neighbors (k-NN) section of the Machine Learning Fundamentals module. In this section, we will explore the k-NN algorithm, its implementation in Python, and how it can be used to solve classification problems.

K-Nearest Neighbors

Introduction to k-Nearest Neighbors (k-NN)

The k-Nearest Neighbors algorithm is a simple, yet powerful, classification algorithm. It is a type of instance-based learning where the function is approximated locally and all computation is deferred until function evaluation. k-NN is non-parametric, meaning it makes no assumptions about the underlying data distribution.

What is k-Nearest Neighbors?

k-NN works by finding the k training samples that are closest in distance to a new sample and predicting the label from these. The distance metric can be Euclidean, Manhattan, or another distance metric. The predicted class for the new sample is the one that is most common among the k nearest neighbors.

Imagine you want to classify whether a fruit is an apple or an orange based on its weight and color. k-NN will look at the k nearest fruits in the training data and assign the class that is most common among them.

To better understand k-NN, let's visualize it with a graph. Suppose we have a scatter plot of fruit data with weight and color as axes. k-NN will classify a new fruit by looking at the nearest neighbors in this plot.

Let's build a k-NN model to classify whether a person will buy a product based on their age and estimated salary.

Tools We Are Using:

Pandas: A library for data manipulation and analysis.
Scikit-learn: A machine learning library for Python that provides simple and efficient tools for data mining and data analysis.
Matplotlib: A plotting library for creating static, animated, and interactive visualizations in Python.

Step 1: Import Necessary Libraries

In this step, we will import the necessary libraries for data manipulation, model building, and visualization. We will use pandas for data manipulation, scikit-learn for building the k-NN model, and matplotlib for plotting.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import numpy as np

Step 2: Load and Explore the Dataset

We will load the dataset containing information about customers, their age, estimated salary, and whether they purchased a product. We will then explore the dataset to understand its structure and content.

# Load the dataset
data = pd.read_csv('purchase_data.csv')

# Display the first few rows of the dataset
print(data.head())

# Basic statistics of the dataset
print(data.describe())

Step 3: Data Visualization

Data visualization helps us understand the relationship between different variables. Here, we will create a scatter plot to visualize the relationship between age and purchasing decision.

# Visualize the relationship between age and purchasing decision
plt.scatter(data['age'], data['purchased'], c='blue')
plt.xlabel('Age')
plt.ylabel('Purchased')
plt.title('Age vs Purchased')
plt.show()

Step 4: Prepare the Data

In this step, we will prepare the data for modeling. We will split the dataset into features (age and estimated salary) and the target variable (purchased). Then, we will divide the data into training and testing sets to evaluate our model's performance.

# Split the dataset into features and target variable
X = data[['age', 'estimated_salary']]
y = data['purchased']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Build and Train the Model

We will now create an instance of the k-NN classifier and train it using the training data. This step involves fitting the model to the training data so it can learn the relationship between the features and the target variable.

# Create a k-NN classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train the classifier
knn.fit(X_train, y_train)

Step 6: Make Predictions

After training the classifier, we will use it to make predictions on the testing data. This will help us evaluate how well the model performs on new, unseen data.

# Make predictions
predictions = knn.predict(X_test)

Step 7: Evaluate the Model

We will evaluate the performance of our k-NN classifier by calculating the accuracy of its predictions. Accuracy is a common metric used to assess classification models.

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy}')

Advanced Concepts and Practical Exercises:

1. Choosing the Value of k

The value of k in k-NN is a crucial hyperparameter that affects the performance of the model. A smaller k means that noise will have a higher influence, while a larger k makes the decision boundary smoother. You can choose the best value of k by evaluating the model's performance on a validation set.

# Loop through different k values to find the best k
k_values = range(1, 26)
accuracy_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    predictions = knn.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, predictions))

# Plot the accuracy scores for different k values
plt.plot(k_values, accuracy_scores)
plt.xlabel('Value of k')
plt.ylabel('Accuracy')
plt.title('Accuracy for different k values')
plt.show()

2. Visualizing the Decision Boundary

Visualizing the decision boundary can help you understand how the k-NN classifier separates the classes. We will create a mesh grid and plot the decision boundary of our model.

# Create a mesh to plot the decision boundary
x_min, x_max = X_train['age'].min() - 1, X_train['age'].max() + 1
y_min, y_max = X_train['estimated_salary'].min() - 1, X_train['estimated_salary'].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

# Predict class probabilities for each point in the mesh
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundary
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_train['age'], X_train['estimated_salary'], c=y_train, edgecolors='k', marker='o')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.title('k-NN Decision Boundary')
plt.show()

Summary

In this section, we explored k-Nearest Neighbors, a fundamental classification algorithm. We discussed its mathematical foundations, visualized data and decision boundaries, and implemented k-NN classifiers using Python. You now have a solid understanding of k-NN and can apply it to various classification problems.

Stay tuned for the next section, where we will explore Decision Trees and their applications.

Practice

Let's build a k-NN classifier using a different dataset to classify whether an email is spam or not based on features like email length and the number of exclamation marks.

Example Code:

# Load a new dataset
email_data = pd.read_csv('email_data.csv')

# Split the dataset into features and target variable
X_email = email_data[['email_length', 'num_exclamation_marks']]
y_email = email_data['is_spam']

# Split the data into training and testing sets
X_train_email, X_test_email, y_train_email, y_test_email = train_test_split(X_email, y_email, test_size=0.2, random_state=42)

# Create a k-NN classifier
email_knn = KNeighborsClassifier(n_neighbors=5)

# Train the classifier
email_knn.fit(X_train_email, y_train_email)

# Make predictions
email_predictions = email_knn.predict(X_test_email)

# Evaluate the model
email_accuracy = accuracy_score(y_test_email, email_predictions)
print(f'Email Model Accuracy: {email_accuracy}')

In this exercise, you will implement k-NN on a different dataset, understand how to prepare and split data, train the classifier, and evaluate its performance.

You can test the practice in the Notebook.

[Link to dataset]

Licenciado baixo a Licenza Creative Commons Atribución Compartir igual 4.0