## Regression

### Introduction

Most of the other chapters of our machine learning tutorial with Python are dealing with classification problems. Classification is the task of predicting a discrete class label, whereas regression is the task of predicting a continuous quantity. Some algorithms can be used for both classification and regression, if we apply small modifications: Decision trees and artificial neural networks.

The topics of this chapter will be regression, but what are typcial regression problems?

Typcial regression problems are for example the prediction of

- house prices
- car prices
- exchange rates
- the price of shares

This chapter of our regression tutorial will start with the LinearRegression class of `sklearn`

.

Yet, the bulk of this chapter will deal with the `MLPRegressor`

model from `sklearn.neural network`

. It is a Neural Network model for regression problems. The name is an acronym for multi-layer perceptron regression system. MLP or multi-layer perceptron is an artificial neural network (ANN), which consists of a minimum of three layers:

- an input layer,
- one or more hidden layers and
- an output layer.

Yet, before we start with the `MLPRegressor`

, we will have a look at simpler models.

### Simple Example with Linear Regression

`LinearRegression`

from the `sklearn`

module is an ordinary least
squares Linear Regression.

LinearRegression fits a linear model with coefficients $w = (w_1, …, w_p)$ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

Linear regression is about finding a line of the form

$$y= a \cdot x + b$$where x is the explanatory variable and y is the dependent variable. The slope of the line is a, and b is the intercept, which corresponds to the value of y if y is equal to 0.

Our first example uses the data of the file `regr_example_data1.csv`

.
We will first load the data and separate the data in the X and y values:

```
import numpy as np
data = np.loadtxt("data/regr_example_data1.csv", delimiter=",")
data[:4]
```

```
X = data[:, 0]
y = data[:, 1]
```

The data was artificially created. When we created the data, we modified the data created by $4.5 \cdot x + 2.8$. This has been the `ideal`

data and we name it in the following plot as y_opt:

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
y_opt = 4.5 * X + 2.8
ax.scatter(X, y)
ax.plot(X, y_opt)
```

We will use now `LinearRegression`

from `sklearn`

to calculate a linear approximation. Of course, we pretend not to know the equation $4.5 \cdot x + 2.8$ with which we had created the test data. We fit the values `X`

to `y`

by using the `fit`

function:

```
from sklearn.linear_model import LinearRegression
X = X.reshape(X.shape[0], 1)
reg = LinearRegression().fit(X, y)
```

We can use the method `predict`

which uses the linear model to predict result. We can apply it to the X values to see, if it works well:

```
y_predict = reg.predict(X)
y_predict[-10:]
```

```
help(reg.predict)
```

We get also the value for the intercept and the coefficient:

```
reg.intercept_, reg.coef_
```

This means, we can write the straight line equation calculated by the regression:

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
y_opt = X * 4.5 + 2.8
y_reg = X *reg.coef_[0] + reg.intercept_
ax.scatter(X, y)
ax.scatter(X, y_predict, color="orange")
ax.plot(X, y_opt)
ax.plot(X, y_reg, color="green")
```

```
# we have to reshape X from two dimensional to one-dimensional
X = X.reshape(-1)
```

```
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(X, y)
```

We can visualize the data and the results:

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(X, y)
ax.plot(X, X*slope + intercept)
```

```
from scipy import stats
import numpy as np
data = np.loadtxt("data/car_prices_linear.txt")
years = data[:,0]
prices = data[:,1]
slope, intercept, r_value, p_value, std_err = stats.linregress(years, prices)
```

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(years, prices)
ax.plot(years, years*slope + intercept)
```

```
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model, metrics
data_sets = train_test_split(years.reshape(-1,1),
prices.reshape(-1,1),
train_size=0.8,
test_size=0.2,
random_state=42)
train_data, test_data, train_targets, test_targets = data_sets
train_data[-5:], train_targets[-5:]
```

```
regr = linear_model.LinearRegression()
regr.fit(train_data, train_targets)
regr.predict(train_data)
```

```
train_targets[:20]
```

```
from scipy import stats
import numpy as np
data = np.loadtxt("data/car_prices.txt")
years = data[:, 0]
prices = data[:, 1]
data_sets = train_test_split(years.reshape(-1,1),
prices,
train_size=0.8,
test_size=0.2,
random_state=42)
print(years.shape)
train_data, test_data, train_targets, test_targets = data_sets
```

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(years, prices)
```

```
from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(solver='lbfgs',
alpha=1e-5, # used for regularization, ovoiding overfitting by penalizing large magnitudes
hidden_layer_sizes=(5, 2), random_state=24)
clf.fit(train_data, train_targets)
res = clf.predict(train_data)
res
```

An explanation of the parameters of MLPRegressor comes further down in this chapter.

```
predictions = clf.predict(train_data)
predictions[:10]
```

```
train_targets[:10]
```

```
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
data, targets = make_regression(n_samples=100,
n_features=1,
noise=0.1)
data[:5]
```

```
from sklearn import datasets
```

```
plt.scatter(data, targets)
plt.show()
```

```
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
data, targets = make_regression(n_samples=100,
n_features=3,
#shuffle=True,
noise=0.1)
data[:5]
```

```
import pandas as pd
data_df = pd.DataFrame(data)
data_df.insert(len(data_df.columns),
column="result",
value=targets)
data_df.columns = "blue", "green", "red", "result"
data_df.head(5)
```

```
from sklearn.model_selection import train_test_split
data_sets = train_test_split(data,
targets,
test_size=0.30,
random_state=42)
data_train, data_test, targets_train, targets_test = data_sets
clf = MLPRegressor(solver='lbfgs', # ‘lbfgs’, ‘sgd’, ‘adam’ (default)
alpha=1e-5, # used for regularization, ovoiding overfitting by penalizing large magnitudes
hidden_layer_sizes=(3, 1),
activation='logistic', # ‘identity’, ‘logistic’, ‘tanh’, ‘relu’ (default)
max_iter=10000,
random_state=42)
clf.fit(data_train, targets_train)
clf.predict(data_train)
```

### California-Housing-Dataset

Now, we will use a dataset derived from the 1990 U.S. census. The data is solely from California. The data is organized as one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

The dataset contains 20640 samples. Each sample with 8 numeric, predictive attributes and the target value.

The attributes are:

- MedInc: median income in block
- HouseAge: median house age in block
- AveRooms: average number of rooms
- AveBedrms: average number of bedrooms
- Population: block population
- AveOccup: average house occupancy
- Latitude: house block latitude
- Longitude: house block longitude

This dataset was obtained from the StatLib repository: http://lib.stat.cmu.edu/datasets/

It can be downloaded via a sklearn function:

`earn.datasets.fetch_california_housing`

We download the California housing dataset in the following Python code lines:

```
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()
```

Let us look at the feauture names:

```
feature_names = dataset['feature_names']
print("Feature names: {}\n".format(feature_names))
```

As usual the `data`

contains the data for the samples. Each line contains eight float values, which correspond to the features:

```
print("number of samples in the file (number of rows): ", dataset.data.shape[0])
print("number of features per row (columns): ", dataset.data.shape[1])
dataset.data[:4]
```

### A Closer Look at the California Housing Data

As we have already mentioned, each sample (row) of the data corresponds to a block (district). A clock contains an unspecified number of houses. The second column (`index == 1`

) contains the median house age of the block. We can filter the data by looking at the blocks with an avarage age less than 10 years:

```
n = 10 # median house age in a block
# data where the houses in the block are less than 10 years old:
dataset.data[dataset.data[:,1]<n]
```

The target variable is the median house value for California districts:

```
dataset.target
```

```
import pandas as pd
data_df = pd.DataFrame(dataset.data)
data_df.columns = ["MedInc", "HouseAge", "AveRooms",
"AveBedrms", "Population", "AveOccup",
"Latitude", "Longitude"]
data_df.head(5)
```

We will insert now the average house value to our DataFrame:

```
data_df.insert(loc=len(data_df.columns),
column="AvePropVal",
value=dataset.target.reshape(-1,1))
```

```
data_df[:5]
```

There are blocks where the average number of rooms is more than a hundred. Not very likely? At least the average number of bedrooms correlates:

```
data_df[data_df['AveRooms']>100]
```

We can assume that the samples (districts) where the average number of rooms is larger than 11 must be rubbish. So we have a look at the data where the number is less than 12:

```
no_of_districts_before_cleansing = len(data_df)
data_df = data_df[data_df['AveRooms']<12]
```

```
print(no_of_districts_before_cleansing)
print("number of removed districts: ", no_of_districts_before_cleansing - len(data_df))
```

Now, we can have a look at the historgram of the cleansed data:

```
data_df['AveRooms'].hist()
```

We can have a similar reasoning for the number of people living in a house. Let's check the number of districts, where the AveOccup is greater than 13:

```
data_df[data_df['AveOccup']>13]
```

```
data_df['AveOccup'][data_df['AveOccup']<=13].hist()
```

Let's have a look at all the feature histograms:

```
data_df.hist(bins=50, figsize=(15,15))
plt.show()
```

```
from pandas.plotting import scatter_matrix
attributes = ['HouseAge', 'MedInc',
'AveRooms', 'AveOccup']
scatter_matrix(data_df[attributes], figsize=(12,8));
```

```
data_df.insert(len(data_df.columns),
column="AveHouseValue",
value=dataset.target)
```

```
data_df.plot(kind='scatter', x='Population', y='AveHouseValue',
alpha=0.1, figsize=(8,5))
```

```
from sklearn.model_selection import train_test_split
data_sets = train_test_split(dataset.data,
dataset.target,
test_size=0.30,
random_state=42)
data_train, data_test, targets_train, targets_test = data_sets
```

```
clf = MLPRegressor(solver='lbfgs', # ‘lbfgs’, ‘sgd’, ‘adam’ (default)
alpha=1e-5, # used for regularization, ovoiding overfitting by penalizing large magnitudes
hidden_layer_sizes=(10, 2),
activation='logistic', # ‘identity’, ‘logistic’, ‘tanh’, ‘relu’ (default)
max_iter=10000,
random_state=42)
clf.fit(data_train, targets_train)
clf.predict(data_train)
```

Parameters:

activation{‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’ Activation function for the hidden layer.

- ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x
- ‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).
- ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).
- ‘relu’, the rectified linear unit function, returns f(x) = max(0, x)

solver{‘lbfgs’, ‘sgd’, ‘adam’}, default=’adam’ The solver for weight optimization.

- ‘lbfgs’ is an optimizer in the family of quasi-Newton methods.
- ‘sgd’ refers to stochastic gradient descent.
- ‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba

**Note:**The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

```
max_ave_rooms = 12
# column with index 2 corresponds to average number of rooms
shape = dataset.data.shape
cleansed_shape = dataset.data[dataset.data[:,2] <= max_ave_rooms].shape
print(shape, cleansed_shape)
n_outliers = shape[0]-cleansed_shape[0]
print(f"Number of outliers, more than {max_ave_rooms} bedrooms: {n_outliers}")
```

Let us remove all data with an average number of rooms greater than `max_ave_rooms`

:

```
x = dataset.data[:,2] <= max_ave_rooms # Boolean array
data = dataset.data[x]
targets = dataset.target[x]
data.shape, targets.shape
```

```
data.shape
```

Before we go on like this, let us have a look at the statistics for each feature:

```
data_df.describe()
```

```
# "MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude"
x = data[:,5] <= 10 # AveOccup
data = data[x]
targets = targets[x]
data.shape, targets.shape
```

```
avebedrms_index = 3
np.min(data[:,3]), np.max(data[:,3])
```

```
np.max(dataset.data[:3])
```

The outliers for this feature have already disappeared due to the previous cleaning actions.

```
from sklearn.model_selection import train_test_split
data_sets = train_test_split(data,
targets,
test_size=0.30,
random_state=42)
data_train, data_test, targets_train, targets_test = data_sets
```

```
from sklearn.model_selection import train_test_split
data_sets = train_test_split(dataset.data,
dataset.target,
test_size=0.30,
random_state=42)
data_train2, data_test2, targets_train2, targets_test2 = data_sets
```

```
data_train.shape, data_train2.shape
```

```
clf = MLPRegressor(solver='lbfgs', # ‘lbfgs’, ‘sgd’, ‘adam’ (default)
alpha=1e-5, # used for regularization, ovoiding overfitting by penalizing large magnitudes
hidden_layer_sizes=(10, 2),
activation='logistic', # ‘identity’, ‘logistic’, ‘tanh’, ‘relu’ (default)
max_iter=10000,
random_state=42)
clf.fit(data_train, targets_train)
print(clf.score(data_train, targets_train))
print(clf.score(data_test, targets_test))
```

We can see that classifying with uncleansed data gives us slightly worse results:

```
clf = MLPRegressor(solver='lbfgs', # ‘lbfgs’, ‘sgd’, ‘adam’ (default)
alpha=1e-5, # used for regularization, ovoiding overfitting by penalizing large magnitudes
hidden_layer_sizes=(10, 2),
activation='logistic', # ‘identity’, ‘logistic’, ‘tanh’, ‘relu’ (default)
max_iter=10000,
random_state=42)
clf.fit(data_train, targets_train)
print(clf.score(data_train2, targets_train2))
print(clf.score(data_test2, targets_test2))
```

```
from sklearn import preprocessing
data_scaled = preprocessing.scale(data)
data_scaled.shape
```

```
data_scaled[:5]
```

```
data[:5]
```

```
from sklearn.preprocessing import PolynomialFeatures
```

Polynomial Features:

PolynomialFeatures does the following: Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form $[a, b]$, the degree-2 polynomial features are $[1, a, b, a^2, ab, b^2]$.

```
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
X
```

```
poly = PolynomialFeatures(degree=2)
poly.fit_transform(X)
```

If the interaction_only parameter is set to True, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.).

```
poly = PolynomialFeatures(interaction_only=True)
poly.fit_transform(X)
```

Let's get back to our housing data:

```
pft = PolynomialFeatures(degree=2)
data_poly = pft.fit_transform(data_scaled)
data_poly
data_poly
```

```
from sklearn.model_selection import train_test_split
data_sets = train_test_split(data_poly,
targets,
test_size=0.30,
random_state=42)
data_train, data_test, targets_train, targets_test = data_sets
```

```
clf = MLPRegressor(solver='lbfgs', # ‘lbfgs’, ‘sgd’, ‘adam’ (default)
alpha=1e-5, # used for regularization, ovoiding overfitting by penalizing large magnitudes
hidden_layer_sizes=(5, 2),
activation='relu', # ‘identity’, ‘logistic’, ‘tanh’, ‘relu’ (default)
max_iter=10000,
early_stopping=True,
random_state=42)
clf.fit(data_train, targets_train)
print(clf.score(data_train, targets_train))
print(clf.score(data_test, targets_test))
```

```
clf.predict(data_train)
```

```
targets_train
```