Random forest models for the prediction of biome types and climate variables

Tip
The paper is available here.

Reflection Link to heading

This project was a big learning moment when it comes selecting training and testing datasets appropriately in statistical learning. The model can only ever be as good as the data we use. It was one of the first times working with geographical, grid-based data, which was also interesting. Since all of the data worked with was directly fed from a model, it’s also important to know the limits of one’s original model which provided the data. Sometimes, the problem may not be our classifier or regression model, but simply that we did not have enough, or the right, information to properly distinguish data in the first place.

This project strengthened my skills in building basic random forest pipelines, from data partitioning and preprocessing to hyperparameter tuning, performance evaluation, and model interpretation, all within the context of environmental and climate data.

Summary Link to heading

In this project, I developed some random forest models to predict biome classes (both binary and multi-class) as well as two continuous climate variables: vegetation carbon pool (VegC) and net primary productivity (NPP). To adress class imbalances, I used SMOTE up-sampling, which notably improved recall for underrepresented classes. By grid-search cross-validation I tried to tune the models better. Performance was evaluated using accuracy, precision, recall, and F₁ score for classifiers, and RMSE for regressors. Finally, I analyzed feature importance to better understand which climate variables, such as seasonal precipitation or extreme temperatures, were driving the model predictions.

Results Link to heading

The binary biome classifier achieved up to 85.7% accuracy after SMOTE resampling, effectively distinguishing between temperate deciduous and mixed forests, with winter temperature and autumn precipitation emerging as key predictors. The multi-class classifier reached a weighted F₁ score of around 0.65, although it struggled to separate closely related biomes, reflecting the continuity between the biomes proves challenging for ML to solve. The regression models performed well overall, but revealed spatial biases around coastal and desert areas, suggesting the need to account for additional local processes like soil variability or ocean influences.