In this semester the time had finally come. For the first time, I had the opportunity to work with a machine learning algorithm. We were assigned a Lab with various tasks around decision trees and AdaBoost
I have completed the Jupyter Notebook and uploaded it to Paperspace:
We had to implement the algorithms from scratch and compare it to the sklearn library. Coding this was a lot of fun, however, I was wondering how AdaBoost would perform with on real data.
AdaBoost is a machine learning algorithm in the category of ensemble learning. Ensembles combine multiple complementary classifiers to increase its predictive performance. Combining multiple weak learners builds one strong classifier. It is very easy to implement AdaBoost (even without libraries).
The algorithm generates hypotheses by successively reweighing the training examples.
Like in the assignment I used Scikit-learn. Sklearn is a free machine learning library for the Python programming language. Pandas and numpy are also needed.
import pandas as pd import numpy as np from collections import defaultdict from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score
Like mentioned before, I wanted to apply the AdaBoost algorithm on real data. There is that huge data collection from the European social survey, which contains survey data.
Participants may not answer all questions, are indifferent or, a column contains various data types. This is why the data set needs to be cleaned up, however since I am too lazy to read the documentation for all 3000 columns, I just loaded the CSV file as a data frame and remove all columns with multiple data types and NA's.
target_col = 'gndr' df = pd.read_csv('ESS1-8e01.csv', low_memory=False) df = df.dropna(subset=[target_col]) df = df.select_dtypes(exclude=['object'])
By splitting up the data frame randomly, I can use one half to train the AdaBoost algorithm and another one to evaluate the accuracy of my algorithm.
np.random.seed(42) df = df.sample(frac=1) split = int(df.shape / 100 * 20) dat_test = df.iloc[0:split] dat_train = df.iloc[(split + 1):df.shape]
Finally, I prepare my data frames for sklearn and let it crunch through those numbers.
d = defaultdict(LabelEncoder) dat_train_encoded = dat_train.apply(lambda x: d[x.name].fit_transform(x)) dat_test_encoded = dat_test.apply(lambda x: d[x.name].fit_transform(x)) x_train = dat_train_encoded[dat_train_encoded.columns.difference([target_col])] y_train = dat_train_encoded[target_col] x_test = dat_test_encoded[dat_test_encoded.columns.difference([target_col])] y_test = dat_test_encoded[target_col] abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion="gini", max_depth=1), n_estimators=1000) model = abc.fit(x_train, y_train) predictions = model.predict(x_test) print(accuracy_score(y_test, predictions))
After a while the Python script completes and outputs an accuracy score of 0.84.