Información
Welcome to the Decision Trees section of the Machine Learning Fundamentals course. In this section, we will explore the Decision Trees algorithm, its implementation in Python, and how it can be used to solve classification problems.
Welcome to the Decision Trees section of the Machine Learning Fundamentals course. In this section, we will explore the Decision Trees algorithm, its implementation in Python, and how it can be used to solve classification problems.
Decision Trees are a type of supervised learning algorithm that are used for both classification and regression tasks. They work by splitting the data into subsets based on the value of input features. This process is repeated recursively, creating a tree-like structure where each node represents a decision based on a feature, and each branch represents the outcome of that decision.
A Decision Tree is a flowchart-like structure in which each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.
Imagine you want to classify whether a person will buy a product based on their age and estimated salary. A Decision Tree will split the data based on these features and create a tree where each node represents a decision (e.g., age < 30, estimated salary > 50k) leading to a prediction.
To better understand Decision Trees, let's visualize it with a graph. Suppose we have a dataset with age and estimated salary. A Decision Tree will split the data at different points to create a tree structure that helps classify whether a person will buy a product.
Let's build a Decision Tree model to classify whether a person will buy a product based on their age and estimated salary.
In this step, we will import the necessary libraries for data manipulation, model building, and visualization. We will use pandas for data manipulation, scikit-learn for building the Decision Tree model, and matplotlib for plotting.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import tree
We will load the dataset containing information about customers, their age, estimated salary, and whether they purchased a product. We will then explore the dataset to understand its structure and content.
# Load the dataset
data = pd.read_csv('purchase_data.csv')
# Display the first few rows of the dataset
print(data.head())
# Basic statistics of the dataset
print(data.describe())
Data visualization helps us understand the relationship between different variables. Here, we will create a scatter plot to visualize the relationship between age and purchasing decision.
# Visualize the relationship between age and purchasing decision
plt.scatter(data['age'], data['purchased'], c='blue')
plt.xlabel('Age')
plt.ylabel('Purchased')
plt.title('Age vs Purchased')
plt.show()
In this step, we will prepare the data for modeling. We will split the dataset into features (age and estimated salary) and the target variable (purchased). Then, we will divide the data into training and testing sets to evaluate our model's performance.
# Split the dataset into features and target variable
X = data[['age', 'estimated_salary']]
y = data['purchased']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
We will now create an instance of the Decision Tree classifier and train it using the training data. This step involves fitting the model to the training data so it can learn the relationship between the features and the target variable.
# Create a Decision Tree classifier
clf = DecisionTreeClassifier()
# Train the classifier
clf.fit(X_train, y_train)
After training the classifier, we will use it to make predictions on the testing data. This will help us evaluate how well the model performs on new, unseen data.
# Make predictions
predictions = clf.predict(X_test)
We will evaluate the performance of our Decision Tree classifier by calculating the accuracy of its predictions. Accuracy is a common metric used to assess classification models.
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy}')
Visualizing the Decision Tree can help you understand how the classifier makes decisions. We will use the plot_tree function from scikit-learn to visualize the tree structure.
# Plot the decision tree
plt.figure(figsize=(20,10))
tree.plot_tree(clf, filled=True, feature_names=['age', 'estimated_salary'], class_names=['Not Purchased', 'Purchased'])
plt.show()
Understanding the importance of each feature can provide insights into the data and the model. We will examine the feature importance scores from the trained Decision Tree model.
# Feature importance
print(f'Feature Importance: {clf.feature_importances_}')
In this section, we explored Decision Trees, a fundamental classification algorithm. We discussed its mathematical foundations, visualized data and decision boundaries, and implemented Decision Tree classifiers using Python. You now have a solid understanding of Decision Trees and can apply it to various classification problems.
Stay tuned for the next section, where we will explore Ensemble Methods and their applications.
Let's build a Decision Tree classifier using a different dataset to classify whether an email is spam or not based on features like email length and the number of exclamation marks.
# Load a new dataset
email_data = pd.read_csv('email_data.csv')
# Split the dataset into features and target variable
X_email = email_data[['email_length', 'num_exclamation_marks']]
y_email = email_data['is_spam']
# Split the data into training and testing sets
X_train_email, X_test_email, y_train_email, y_test_email = train_test_split(X_email, y_email, test_size=0.2, random_state=42)
# Create a Decision Tree classifier
email_clf = DecisionTreeClassifier()
# Train the classifier
email_clf.fit(X_train_email, y_train_email)
# Make predictions
email_predictions = email_clf.predict(X_test_email)
# Evaluate the model
email_accuracy = accuracy_score(y_test_email, email_predictions)
print(f'Email Model Accuracy: {email_accuracy}')
In this exercise, you will implement a Decision Tree classifier on a different dataset, understand how to prepare and split data, train the classifier, and evaluate its performance.
You can practice the Decision Trees algorithm in this Notebook.
[Link to dataset]
Licenciado baixo a Licenza Creative Commons Atribución Compartir igual 4.0