📈 Ensemble Learning
3 min read

📈 Ensemble Learning

📈 Ensemble Learning

In this semester the time had finally come. For the first time, I had the opportunity to work with a machine learning algorithm. We were assigned a Lab with various tasks around decision trees and AdaBoost

I have completed the Jupyter Notebook and uploaded it to Paperspace:

Paperspace
Cloud Machine Learning, AI, and effortless GPU infrastructure

We had to implement the algorithms from scratch and compare it to the sklearn library. Coding this was a lot of fun, however, I was wondering how AdaBoost would perform with on real data.

AdaBoost

AdaBoost is a machine learning algorithm in the category of ensemble learning. Ensembles combine multiple complementary classifiers to increase its predictive performance. Combining multiple weak learners builds one strong classifier. It is very easy to implement AdaBoost (even without libraries).

The algorithm generates hypotheses by successively reweighing the training examples.

Playing around

Python

Like in the assignment I used Scikit-learn. Sklearn is a free machine learning library for the Python programming language. Pandas and numpy are also needed.

import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

Data

Like mentioned before, I wanted to apply the AdaBoost algorithm on real data. There is that huge data collection from the European social survey, which contains survey data.

European Social Survey | European Social Survey (ESS)

Participants may not answer all questions, are indifferent or, a column contains various data types. This is why the data set needs to be cleaned up, however since I am too lazy to read the documentation for all 3000 columns, I just loaded the CSV file as a data frame and remove all columns with multiple data types and NA's.

target_col = 'gndr'
df = pd.read_csv('ESS1-8e01.csv', low_memory=False)
df = df.dropna(subset=[target_col])
df = df.select_dtypes(exclude=['object'])

By splitting up the data frame randomly, I can use one half to train the AdaBoost algorithm and another one to evaluate the accuracy of my algorithm.

np.random.seed(42)
df = df.sample(frac=1)
split = int(df.shape[0] / 100 * 20)
dat_test = df.iloc[0:split]
dat_train = df.iloc[(split + 1):df.shape[0]]

Finally, I prepare my data frames for sklearn and let it crunch through those numbers.

d = defaultdict(LabelEncoder)
dat_train_encoded = dat_train.apply(lambda x: d[x.name].fit_transform(x))
dat_test_encoded = dat_test.apply(lambda x: d[x.name].fit_transform(x))

x_train = dat_train_encoded[dat_train_encoded.columns.difference([target_col])]
y_train = dat_train_encoded[target_col]

x_test = dat_test_encoded[dat_test_encoded.columns.difference([target_col])]
y_test = dat_test_encoded[target_col]

abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion="gini", max_depth=1), n_estimators=1000)
model = abc.fit(x_train, y_train)
predictions = model.predict(x_test)

print(accuracy_score(y_test, predictions))

After a while the Python script completes and outputs an accuracy score of 0.84.