Case study:
Taming complexity by stacking models
Summary
We improved a land valuation model by 30% using another model to transform data for the original model.
We built a second model to predict the value of flats for that land at that time, and included these predictions as variables in the first model, which was enough to reduce test error by aforementioned third.
Introduction
The Singapore government periodically sells some of its land, and this was of interest for building residential projects.
The opportunity for better valuation can be seen in the sometimes huge difference between the winning bid and the next highest. Some buyers left hundreds of millions of dollars on the table.
We were thus attempting to predict the top bid for Singapore state land sales, and wanted to include information from a number of variables not included in the site information, such as social housing unit resale price, public transport or schools.
This necessitated a sophisticated model to tame the complexity of a situation with a lot of important, collinear variables and a relatively small sample (as there are few state land sales per year).
Problem: building indices
We had initially thought of including relative average price movements for each location over time as a form of local price index. This was simply computed as the average resale value of a unit per town, divided by the average resale value of all units in the country.
Unfortunately, that information was not sufficient to reduce the variance in the land valuation model, and so a more sophisticated model was called for.
First, we clustered ~100,000 units by 35 variables, using unsupervised learning.
K-means with 4 clusters best classified the units into four groups: the affordable, the average priced, the nearby expensive ones and the faraway expensive ones.
However, further clustering seemed to prefer three groups, as shown in the hierarchical clustering visualisation by the fourth (red) cluster. We decided to drop distance and classify by (expected or realised) value only.
Solution: a model for the model
We built a new model to predict the value of an affordable unit, an average unit and an expensive unit for a specific location at a specific time, including a number of variables such as premium commercial real estate rents, secondary languages taught in nearby schools or the number of bus lines stopping in the area.
We could thus capture the information contained in these variables and transform it into a format more ingestible by the land valuation model: a mere three variables that could be thought of as “what was the top, middle and bottom of the market doing at that time”, effectively indices for each market slice.
Classifying and valuing apartments
The first step was to build a model for predicting the value of any unit based on the historical data. For example:
- The unit: 3 room new generation in Ang Mo Kio. 2 bedrooms, 2 bathrooms, storey 1 to 3, sold in March 2018. During that time there were 24,296 documents lodged, the retail rent in Orchard was SGD 9.79 psf with a vacancy rate of 6.1%. Ang Mo Kio schools have an average of 1.2 MRT stations, 8.4 bus lines. There are 0.000053357 secondary schools per resident…
- The sale: realised SGD 394 per square foot (psf).
- The model: predicted SGD 385 psf, an error of -2.4%.
The next step was defining what was an affordable, average and premium unit. Aside from supervised learning, examining historical data showed some clear price bands, with the premium price band behaving independently of the other two which was encouraging - some of the variance was after all explained by the unit mix!
We picked the median apartment and its variables for each price band: the median type, number of rooms, storey, and so on. These were our index units, effectively fictional apartments for which we would predict a value for each land sale, this value serving as a price index for the low, medium and high ends of the residential market at that time and place.
Results and conclusion
We fed the predictions as variables to the land valuation model and, without any other changes, the test error improved by a full 30%.
Variables such as the properties of nearby schools or number of bus lines did matter, quite a bit indeed!
Choosing a different model gave us a further 30% improvement, but that is another story for another time.