Billy Fung


newcomer to the scene

Billy Fung / 2017-09-09

Continuing on from the discussion about needing beefier computation, sometimes throwing more money and power at the problem isn’t the answer. Well it rarely is, and even though I prefer to take time to find more optimal and efficient solutions, in the real world I only have so much time and resources to get a problem solved.

Following a procedure

Coming from a more classical statistics background, my view of the more recent advances in machine learning is that I am always a bit behind in the newer advancements, especially ones which require a more technical background to approach. It is always best to start off with more basic (and fast) algorithms to establish a baseline before diving deep into models that take hours to train. Following a routine of basic linear regression first, and then basic tree models, and then more complicated mixtures allows for easy to follow analysis. This way, if you have a linear model that is quick to train that performs better than more advanced models, you won’t be wasting time.

The other benefit of trying out the older and more understood models is that it helps to develop an intuition into your data. The dark art of model building and choosing what fits best comes from practice and doing it over and over again. Which brings me to the main topic of LightGBM.

A new challenger

XGBoost came into the scene awhile back and drew many followers through winning Kaggle competitions. Without going too deep into the technical bits of gradient boosted frameworks, XGboost makes usage of trees that run in parallel to perform decisions.

With XGBoost, I used it within the h2o machine learning engine, which make it easy to set up and test multiple models. But I quickly found that it became very memory intensive, so I looked to other algorithms.

LightGBM offers better memory management, and a faster algorithm due to the “pruning of leaves” to manage the number and depth of trees that are grown.

R and LightGBM

Compiler set up

    # for linux
    sudo apt-get install cmake
    # for os x
    brew install cmake
    brew install gcc --without-multilib

Then to install LightGBM as a R-package “` git clone -recursive cd LightGBM/R-package

export CXX=g++-7 CC=gcc-7 # for OSX

R CMD INSTALL -build . -no-multiarch “`

Organising the dataset

As usual, each algorithm wants specific formats for the input data, usually in the form of a matrix (sparse or not).

    categoricals = c('V1', 'V2', 'V3')
    dtrain <- lgb.Dataset((data.matrix(train[,x_features, with=FALSE])),
                         categorical_feature = categoricals,
                         label = train$target)

With your dataset partitioned into train and test dataframes/datatables, the simple way is to convert into a matrix. This will create a lgb dataset object with the training data.

Building the model

    model <- lgb.train(params,
                       nrounds = 5500,
                       num_leaves = 50,
                       learning_rate = 0.1,
                       max_bin = 550,
                       verbose = 0
         user    system   elapsed
    24761.032     5.844  2107.373
    preds <- predict(model, data.matrix(test[,x_features, with=FALSE]))

I have included the computation numbers here as a comparison for other models. The training data is 33m rows, which ends up being the largest dataset I’ve ever worked out. The size itself was the issue for me, which lead me to use LightGBM.


A semi-tuned gbm model gave me RMSE results of around 4, which is the number I’ve been working towards. The issue arose when I wanted to tune xgb models with h2o, which took around 2 hours to train a model. Of course, there will be a factor of me not knowing what I’m really doing. Being limited in computational power, I didn’t know how to properly set up a distributed xgb model with h2o, which I’m not sure is even possible with the current versions. This lead me to not be able to properly figure out what the optimal parameters for the model are.

To me, LightGBM straight out of the box is easier to set up, and iterate. Which makes it easier do tuning and iteration of the model. I can train models on my laptop with 8gb of RAM, with 2000 iteration rounds. This I could’ve never done with h2o and xgb.

Of course, there is still a lot of other tuning and investigation that needs to be done, but that’s for next time.