2.3.1 Linear Regression

Linear Regression

Introduction to Linear Regression

Linear regression is one of the simplest and most widely used techniques in **Machine Learning**. It is a **supervised learning algorithm** used for predicting a continuous value based on one or more input variables. In simple terms, it finds a straight-line relationship between input and output data.

What is Linear Regression?

Linear regression is a mathematical approach that helps us understand how one variable (the dependent variable) changes based on another variable (the independent variable). For example, we can use linear regression to predict:

The price of a house based on its size.
The sales of a product based on advertising budget.
The temperature of a city based on past weather data.

The key idea is that we fit a **straight line** to our data that best represents the relationship between the variables.

How Does It Work?

Linear regression finds the **best-fitting line** through the data points by minimizing the difference between the actual values and the predicted values. The equation of a simple linear regression model is:

Y = mX + b

Where:

Y is the value we want to predict (dependent variable).
X is the input variable (independent variable).
m is the slope of the line (shows how much Y changes for each unit of X).
b is the intercept (the value of Y when X = 0).

Why Use Linear Regression?

Linear regression is useful because it is:

**Easy to understand** – It is one of the most basic predictive models.
**Quick to implement** – It requires minimal computing power compared to other machine learning models.
**Interpretable** – The equation provides a clear mathematical relationship between variables.

How Is It Different from Other Machine Learning Algorithms?

Unlike more complex machine learning models, **linear regression assumes a direct relationship between variables**. This makes it different from algorithms such as:

**Decision Trees** – Which split data into different categories rather than fitting a straight line.
**Neural Networks** – Which use multiple layers of interconnected nodes for more complex pattern recognition.
**Clustering Algorithms** – Which group similar data points instead of predicting values.

While linear regression is useful for many simple predictions, it may not work well for data with **non-linear relationships**. In such cases, other machine learning techniques might be more effective.

What Will You Learn in This Section?

In this section, you will:

Understand how linear regression works.
Train a simple regression model using real-world data.
Make predictions using the model.
Visualize and interpret the results.

By the end of this section, you will be able to use linear regression to solve basic prediction problems. Let’s get started!

Vídeo

https://www.youtube.com/embed/M_mGjdzatSo?si=hEJiA4KqvxxqfJXP" allowfullscreen="allowfullscreen">

Practice: Predicting House Prices

Linear Regression Example: Predicting House Prices

In this example, we will build a **linear regression** model to predict house prices based on their size in square meters. Each step in the process will be explained in detail, showing how the data is prepared, the model is trained, and the predictions are made.

Step 1: Importing Required Libraries

First, we import the necessary Python libraries:

NumPy: Used for handling numerical arrays.
Matplotlib: Used for data visualization.
scikit-learn (sklearn): Used for creating and training the regression model.
train_test_split: Used to split the dataset into training and test sets.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

These libraries will help us create, train, and visualize the linear regression model.

Step 2: Creating the Dataset

We create a small dataset containing **house sizes (in square meters)** and their corresponding **prices (in thousands of dollars)**. This dataset represents real estate price trends, where larger houses tend to have higher prices.


# Simulated data: house size (square meters) and price (thousands of dollars)
X = np.array([50, 60, 75, 80, 100, 120, 150, 170, 200, 220]).reshape(-1, 1)
y = np.array([150, 180, 210, 230, 280, 300, 350, 370, 400, 450])  # Price in thousands of dollars

The variable X represents house sizes, and y represents prices. We reshape X to make it compatible with the scikit-learn model.

Step 3: Splitting the Data into Training and Test Sets

To evaluate the model, we split the data into **training (80%)** and **test (20%)** sets. This ensures that the model is trained on one part of the data and tested on another to measure its accuracy.


# Split data into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

After running this cell, X_train and y_train will contain the training data, while X_test and y_test will contain the test data.

Step 4: Training the Linear Regression Model

Now, we create an instance of the **LinearRegression** model and train it using the training data. The model will learn the relationship between house size and price.


# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

After this step, the model has learned the trend in the data and is ready to make predictions.

Step 5: Displaying the Regression Equation

Once trained, we extract the **equation of the regression line**, which helps us understand the relationship between house size and price.


# Display the regression equation
print(f"Model equation: Price = {model.coef_[0]:.2f} * Size + {model.intercept_:.2f}")

This equation shows how much the price changes for each square meter increase in house size.

Step 6: Visualizing the Regression Model

To better understand how the model fits the data, we create a **scatter plot** of the real data and overlay the **regression line**.


# Visualization of the model
plt.scatter(X, y, color="blue", label="Real Data")
plt.plot(X, model.predict(X), color="red", linewidth=2, label="Linear Regression")
plt.xlabel("Size (m²)")
plt.ylabel("Price (thousands of $)")
plt.title("Linear Regression: House Price Prediction")
plt.legend()
plt.show()

- The blue points represent actual house prices. - The red line represents the predictions made by the model. A well-fitted model should have the red line closely following the trend of the blue points.

Step 7: Predicting House Prices with User Input

Now, we allow the user to enter a house size, and the model will predict its price.


# Interactive input for user to predict house price
import ipywidgets as widgets
from IPython.display import display

def predict_price(size):
    price = model.predict(np.array([[size]]))[0]
    print(f"Predicted price for a {size}m² house: ${price:.2f}k")

size_input = widgets.IntText(description="Size (m²):", value=100)
predict_button = widgets.Button(description="Predict Price")

def on_button_click(b):
    predict_price(size_input.value)

predict_button.on_click(on_button_click)

display(size_input, predict_button)

- The user enters a house size in square meters. - The model predicts the price based on the trained regression equation. This makes the model interactive and useful for real-world applications.

Conclusion

In this example, we successfully implemented **linear regression** to predict house prices. We learned:

How to create and split a dataset.
How to train a linear regression model.
How to visualize the regression line.
How to make predictions based on user input.

This is a simple but powerful technique that can be expanded with more features like **location, number of rooms, and house age** to improve predictions.

Jupyter Notebook

Here is the code to run the practice: Notebook

Licenciado baixo a Licenza Creative Commons Atribución Compartir igual 4.0