2.2.4 Decision Trees

Información

Welcome to the Decision Trees section of the Machine Learning Fundamentals course. In this section, we will explore the Decision Trees algorithm, its implementation in Python, and how it can be used to solve classification problems.

Decision Trees

Introduction to Decision Trees

Decision Trees are a type of supervised learning algorithm that are used for both classification and regression tasks. They work by splitting the data into subsets based on the value of input features. This process is repeated recursively, creating a tree-like structure where each node represents a decision based on a feature, and each branch represents the outcome of that decision.

What are Decision Trees?

A Decision Tree is a flowchart-like structure in which each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

Imagine you want to classify whether a person will buy a product based on their age and estimated salary. A Decision Tree will split the data based on these features and create a tree where each node represents a decision (e.g., age < 30, estimated salary > 50k) leading to a prediction.

To better understand Decision Trees, let's visualize it with a graph. Suppose we have a dataset with age and estimated salary. A Decision Tree will split the data at different points to create a tree structure that helps classify whether a person will buy a product.

Practical Exercise

Let's build a Decision Tree model to classify whether a person will buy a product based on their age and estimated salary.

Tools We Are Using:

Pandas: A library for data manipulation and analysis.
Scikit-learn: A machine learning library for Python that provides simple and efficient tools for data mining and data analysis.
Matplotlib: A plotting library for creating static, animated, and interactive visualizations in Python.

Step 1: Import Necessary Libraries

In this step, we will import the necessary libraries for data manipulation, model building, and visualization. We will use pandas for data manipulation, scikit-learn for building the Decision Tree model, and matplotlib for plotting.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import tree

Step 2: Load and Explore the Dataset

We will load the dataset containing information about customers, their age, estimated salary, and whether they purchased a product. We will then explore the dataset to understand its structure and content.

# Load the dataset
data = pd.read_csv('purchase_data.csv')

# Display the first few rows of the dataset
print(data.head())

# Basic statistics of the dataset
print(data.describe())

Step 3: Data Visualization

Data visualization helps us understand the relationship between different variables. Here, we will create a scatter plot to visualize the relationship between age and purchasing decision.

# Visualize the relationship between age and purchasing decision
plt.scatter(data['age'], data['purchased'], c='blue')
plt.xlabel('Age')
plt.ylabel('Purchased')
plt.title('Age vs Purchased')
plt.show()

Step 4: Prepare the Data

In this step, we will prepare the data for modeling. We will split the dataset into features (age and estimated salary) and the target variable (purchased). Then, we will divide the data into training and testing sets to evaluate our model's performance.

# Split the dataset into features and target variable
X = data[['age', 'estimated_salary']]
y = data['purchased']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Build and Train the Model

We will now create an instance of the Decision Tree classifier and train it using the training data. This step involves fitting the model to the training data so it can learn the relationship between the features and the target variable.

# Create a Decision Tree classifier
clf = DecisionTreeClassifier()

# Train the classifier
clf.fit(X_train, y_train)

Step 6: Make Predictions

After training the classifier, we will use it to make predictions on the testing data. This will help us evaluate how well the model performs on new, unseen data.

# Make predictions
predictions = clf.predict(X_test)

Step 7: Evaluate the Model

We will evaluate the performance of our Decision Tree classifier by calculating the accuracy of its predictions. Accuracy is a common metric used to assess classification models.

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy}')

Advanced Concepts and Practical Exercises:

1. Visualizing the Decision Tree

Visualizing the Decision Tree can help you understand how the classifier makes decisions. We will use the plot_tree function from scikit-learn to visualize the tree structure.

# Plot the decision tree
plt.figure(figsize=(20,10))
tree.plot_tree(clf, filled=True, feature_names=['age', 'estimated_salary'], class_names=['Not Purchased', 'Purchased'])
plt.show()

2. Feature Importance

Understanding the importance of each feature can provide insights into the data and the model. We will examine the feature importance scores from the trained Decision Tree model.

# Feature importance
print(f'Feature Importance: {clf.feature_importances_}')

Summary

In this section, we explored Decision Trees, a fundamental classification algorithm. We discussed its mathematical foundations, visualized data and decision boundaries, and implemented Decision Tree classifiers using Python. You now have a solid understanding of Decision Trees and can apply it to various classification problems.

Stay tuned for the next section, where we will explore Ensemble Methods and their applications.

Practice

Let's build a Decision Tree classifier using a different dataset to classify whether an email is spam or not based on features like email length and the number of exclamation marks.

Example Code:

# Load a new dataset
email_data = pd.read_csv('email_data.csv')

# Split the dataset into features and target variable
X_email = email_data[['email_length', 'num_exclamation_marks']]
y_email = email_data['is_spam']

# Split the data into training and testing sets
X_train_email, X_test_email, y_train_email, y_test_email = train_test_split(X_email, y_email, test_size=0.2, random_state=42)

# Create a Decision Tree classifier
email_clf = DecisionTreeClassifier()

# Train the classifier
email_clf.fit(X_train_email, y_train_email)

# Make predictions
email_predictions = email_clf.predict(X_test_email)

# Evaluate the model
email_accuracy = accuracy_score(y_test_email, email_predictions)
print(f'Email Model Accuracy: {email_accuracy}')

In this exercise, you will implement a Decision Tree classifier on a different dataset, understand how to prepare and split data, train the classifier, and evaluate its performance.

You can practice the Decision Trees algorithm in this Notebook.

[Link to dataset]

Licenciado baixo a Licenza Creative Commons Atribución Compartir igual 4.0