How to build your first ensemble

If you work in AI, data science, research or any similar field (or are interested in those areas), the single best thing you can do to get the best results is to use ensembles in your process. That fact was established long ago. This blog post will show you how to build your first ensemble.

Our finished result will look like this. Here are the error rates for the three models we will use, smaller is better:

> results
[1] 6.108005

[1] 5.478017

[1] 3.776555

As you can see, the Ensembles_Linear model has the lowest error rate by far, compared to the other two results. How was that done?

1. It all starts with the data set. In this case we will use the Boston Housing data set.

library(MASS) # Need for Boston Housing data set
library(tree) # Need to make tree models

head(MASS::Boston, n = 10) # look at the first ten (out of 505) rows of the Boston Housing data set

Let’s look at the first ten rows of the data set:

Screenshot 2024-04-13 at 7.01.55 AM

We are modeling the median value of the house price, called medv in the data set.

2. Next we are going to break the data set into two groups: Train (about 80%) and test (about 20%). There is nothing special about the 80/20 split.

df <- MASS::Boston
train <- df[1:400, ] # the first 400 rows
test <- df[401:505, ] # the last 104 rows

Let’s look at the train and test sets:

Screenshot 2024-04-13 at 7.03.21 AM

Screenshot 2024-04-13 at 7.03.54 AM

3. We’ll start with the linear model. We’ll follow three steps, in this order:

A. Fit the model on the training data
B. Make predictions on the testing data
C. Calculate error rate of the predictions

# Linear model
Boston_lm <- lm(medv ~ ., data = train)

# Predictions for the linear model using the test data (required for the ensemble)
Boston_lm_predictions <- predict(object = Boston_lm, newdata = test)

# Error rate for the linear model using actual vs predicted results
Boston_lm_RMSE <- Metrics::rmse(actual = test$medv, predicted = Boston_lm_predictions)

4. We’ll do exactly the same steps with a different modeling method. We’ll use trees, but there are many other options available. We will follow the same three steps we used for the linear model:

A. Fit the model on the training data
B. Make predictions on the testing data
C. Calculate error rate of the predictions

# Tree model
Boston_tree <- tree(medv ~ ., data = train)

# Predictions for the tree model using the test data (required for the ensemble)
Boston_tree_predictions <- predict(object = Boston_tree, newdata = test)

# Error rate for the tree model using actual and predicted results
Boston_tree_RMSE <- Metrics::rmse(actual = test$medv, predicted = Boston_tree_predictions)

5. The ensemble is built using the predictions of the two methods (linear and trees). It also needs the true value of the data, and that’s the meds value in the testing data set.

# Create the ensemble
ensemble <- data.frame(
'linear' = Boston_lm_predictions,
'tree' = Boston_tree_predictions,
'y_ensemble' = test$medv

The ensemble looks like this:

Screenshot 2024-04-12 at 2.24.47 PM

6. Next we break the ensemble into train (80%) and test(20%).
ensemble_train <- ensemble[1:80, ]
ensemble_test <- ensemble[81:105, ]

Here is what each looks like:

The ensemble training set:
Screenshot 2024-04-12 at 2.25.42 PM

The ensemble testing set:
Screenshot 2024-04-12 at 2.26.27 PM

7. Now we’ll follow the exact same steps we used above, but with the ensemble data set.

A. Fit the model on the training data
B. Make predictions using the testing data
C. Calculate the error rate of the predictions.

# Ensemble linear modeling
ensemble_lm <- lm(y_ensemble ~ ., data = ensemble_train)

# Predictions for the ensemble linear model
ensemble_prediction <- predict(ensemble_lm, newdata = ensemble_test)

# Root mean squared error for the ensemble linear model
ensemble_lm_RMSE <- Metrics::rmse(actual = ensemble_test$y_ensemble, predicted = ensemble_prediction)

8. Let’s put it all together, and look at the results. These are error rates, the smaller the better:

results <- list(
'Linear' = Boston_lm_RMSE,
'Trees' = Boston_tree_RMSE,
'Ensembles_Linear' = ensemble_lm_RMSE

9. Let’s look at the results. Keep in mind these are error rates, so lower is better.

Screenshot 2024-04-12 at 2.28.30 PM

10. Last, let’s check for any errors or warnings:

warnings() # There are no warnings returned.

See if you can build your own ensemble. You may use the Boston Housing data set, or any other data set you wish. Future blog posts will highlight more examples and ways to make ensembles, and show how ensembles produce excellent results.

Complete code for this blog post:

The fierce pursuit of truth in data


Welcome to DataAIP. This is a totally new type of AI, not like any that has come before. We are very happy to have you here!

We are on a mission: The fierce pursuit of truth in data, no matter where that pursuit takes us. We are making tools for people to use to get the most accurate results from their data. The tools use R and Python working together, in ways that produce results that are much better than either one could do alone.

The models built by DataAIP are all built automatically. They do not produce any errors or warnings. There are no typos, no missing commas, no missing or lost periods. All of the models run without any errors.

The system only requires one line of code from you. Tell the system where the data is located, and which column you are looking at. Answer a few easy questions, and then send the system off to the races. It will do everything for you, and return the results as fast as possible.

DataAIP works with four types of data: Numerical, Logistic, Classification, and Time Series. The actual package has example data sets for each of those four types, and future blog posts will demonstrate how they work.

DataAIP is totally new and different from any other AI system out there. It is not like any of the large language models, because it is not a large language model, and does not aggregate previous results posted online. It starts fresh with the data, builds the models in real time, and then presents the results to the user. No aggregation of large models are ever used, because they are not needed.

Future blog posts will explain how to use the tools and get the best possible results.