Continuing on from the discussion about needing beefier computation, sometimes
throwing more money and power at the problem isn't the answer. Well it rarely
is, and even though I prefer to take time to find more optimal and efficient
solutions, in the real world I only have so much time and resources to get a
problem solved.

Following a procedure #

Coming from a more classical statistics background, my view of the more recent
advances in machine learning is that I am always a bit behind in the newer
advancements, especially ones which require a more technical background to
approach. It is always best to start off with more basic (and fast) algorithms
to establish a baseline before diving deep into models that take hours to
train. Following a routine of basic linear regression first, and then basic
tree models, and then more complicated mixtures allows for easy to follow
analysis. This way, if you have a linear model that is quick to train that
performs better than more advanced models, you won't be wasting time.

The other benefit of trying out the older and more understood models is that
it helps to develop an intuition into your data. The dark art of model
building and choosing what fits best comes from practice and doing it over and
over again. Which brings me to the main topic of LightGBM.

A new challenger #

XGBoost came into the scene awhile back and drew many followers through
winning Kaggle competitions. Without going too deep into the technical bits of
gradient boosted frameworks, XGboost makes usage of trees that run in parallel
to perform decisions.

With XGBoost, I used it within the h2o machine learning engine, which make it
easy to set up and test multiple models. But I quickly found that it became
very memory intensive, so I looked to other algorithms.

LightGBM offers better memory management, and a faster algorithm due to the
"pruning of leaves" to manage the number and depth of trees that are grown.

R and LightGBM #

Compiler set up #

    # for linux
    sudo apt-get install cmake
    # for os x
    brew install cmake
    brew install gcc --without-multilib

Then to install LightGBM as a R-package "` git clone -recursive cd LightGBM/R-package

export CXX=g++-7 CC=gcc-7 # for OSX #

R CMD INSTALL -build . -no-multiarch "`

Organising the dataset #

As usual, each algorithm wants specific formats for the input data, usually in
the form of a matrix (sparse or not).

    categoricals = c('V1', 'V2', 'V3')
    dtrain <- lgb.Dataset((data.matrix(train[,x_features, with=FALSE])),
                         categorical_feature = categoricals,
                         label = train$target)

With your dataset partitioned into train and test dataframes/datatables, the
simple way is to convert into a matrix. This will create a lgb dataset object
with the training data.

Building the model #

    model <- lgb.train(params,
                       nrounds = 5500,
                       num_leaves = 50,
                       learning_rate = 0.1,
                       max_bin = 550,
                       verbose = 0
         user    system   elapsed
    24761.032     5.844  2107.373
    preds <- predict(model, data.matrix(test[,x_features, with=FALSE]))

I have included the computation numbers here as a comparison for other models.
The training data is 33m rows, which ends up being the largest dataset I've
ever worked out. The size itself was the issue for me, which lead me to use

Comparison #

A semi-tuned gbm model gave me RMSE results of around 4, which is the number
I've been working towards. The issue arose when I wanted to tune xgb models
with h2o, which took around 2 hours to train a model. Of course, there will be
a factor of me not knowing what I'm really doing. Being limited in
computational power, I didn't know how to properly set up a distributed xgb
model with h2o, which I'm not sure is even possible with the current versions.
This lead me to not be able to properly figure out what the optimal parameters
for the model are.

To me, LightGBM straight out of the box is easier to set up, and iterate.
Which makes it easier do tuning and iteration of the model. I can train models
on my laptop with 8gb of RAM, with 2000 iteration rounds. This I could've
never done with h2o and xgb.

Of course, there is still a lot of other tuning and investigation that needs
to be done, but that's for next time.

← Home