Residential buildings conceptual cost estimates with the use of support vector regression

. Cost analyses, and the conceptual cost estimates among them, are of the key importance for the construction projects successes. Implementation of neural networks or machine learning methods provides broad possibilities for this specific type of cost. The aim of the paper is to present some results of the studies on the use of support vector regression as a machine learning tool for conceptual cost estimates of residential buildings. Results for three models based on support vector regression and radial basis kernel functions are introduced.


Introduction
Cost estimation is one of the key processes in the course of a construction project.Cost estimates serve for parties involved in a project as a basis for various decisions.Conceptual cost estimates, prepared in the early design stage are specific due to the two facts.Firstly the estimates have to rely on imprecise, incomplete information about the building.Secondly the ability to influence building's characteristics and cost are the greater in the early design stage, than later in the course of the project.Both of the aforementioned stimulate the research on development models, that are based on various approaches and methods, capable of supporting conceptual cost estimates.
The aim of this paper is to present the results of investigations on applicability of support vector regression (SVR) method for the purposes of residential buildings conceptual cost estimates.The paper is a contribution to the author's broader research on the development of cost estimation models based on artificial intelligence and machine learning methods, (some exemplary earlier works can be mentioned here [1-3]).
Support vector regression (SVR) originates in support vector machines (SVM) as a machine learning method.The theory and fundamentals of SVM can be found in [4][5][6].Applications of SVM are reported for variety of problems, also those that belong to the field of construction management, such as: development of evolutionary support inference system for construction management [7], automating document classification [8], supporting legal decisions [9], risk prediction [10], estimates of time and cost at completion for construction projects [11], predicting project success [12][13], cost estimates of school buildings [14].
The paper content includes: assumptions made for the application of SVR for the conceptual cost estimates for residential buildings as well as some theoretical background, presentation of results, brief discussion of results and conclusions in the summary section.

Assumptions for implementing support vector regression in conceptual cost estimates of residential buildings
The aim of the study was to investigate applicability of SVR in the problem of conceptual estimating of residential and residential-commercial buildings' cost.The model based on SVR was developed under the assumptions that follow.
The dependent variable of the model, that is cost of a building, was represented by y.The cost was considered in respect of construction works including: site preparation, shell and core, internal and external finishing, installations and services as well as all the primary equipment.(The cost of land improvement and landscaping were not taken into account.)Independent variables included 13 features (cost descriptors), related to the information about buildings available in the early stage of design, adequately for the conceptual cost estimates.The independent variables formed a vector, later referred to as x.(Analysis and selection of the independent variables is presented in previous work [17]).
In general the problem can be defined as regression analysis which aim is to find mapping x → y.Table 1 presents the variables taken into account for the purpose of model development.
Table 1.Font styles for a reference.

Symbol
Description Type and value y the cost of a building numerical (net costexcluding value added tax,discounted for a base year) [mln PLN] x 1 footprint numerical (building's footprint area -m 2 ) x 2 cubature numerical (building's cubic capacity -m 3 ) x 3 number of storeys numerical (number of building's storeys) x 4 type of foundation descriptive (RC continuous footing, RC foundation slab, special foundation) x 5 building structure descriptive (traditional, RC skeletal structure, RC column-wall structure, RC solid-wall structure, RC monolith structure) x 6 roof structure descriptive (RC -flat roof, wooden structure -steep roof) x 7 number of structural segments of a building numerical (number of segments) x 8 number of elevator shafts numerical (number of shafts) x 9 ground conditions descriptive (simple, complex, complicated) † x 10 usable area of dwellings numerical (total usable area of dwellings -m 2 ) x 11 usable area of commercial premises numerical (total usable area of commercial premises -m 2 ) x 12 usable area of underground garages numerical (total usable area of underground garages -m 2 ) x 13 building finishing standard descriptive (developer's standard, turn-key standard) According to the scope of the research the mapping x → y was supposed to be approximated by a regression function f: Function f was supposed to be found with the use of SVR method.(The theoretical background of the SVR method, given in the forthcoming part of the paper, is compiled on the basis of [4][5][6], [15][16]).The SVR machine learning was applied to approximate the regression function as a linear hyperplane, after mapping input examples into a higher dimensional feature space: The mapping Φ(x) is supposed to increase the expressive power of the representation and allows to compute the approximation function in the mapped space H.Under the aforementioned assumptions the sought-for linear hyperplane where x is replaced by Φ(x) can be given as: The use of ε-insensitive loss-function is assumed to measure the errors in the course of the training process: where: In the given loss-function ε defines a tube of insensitiveness around the true values y, which results in toleration of deviations smaller than ε and allows to trade-off function complexity.The optimization problem to be solved: subject to the constraints for the both sides of ε-tube: where C represents complexity of a model, ξ i and ξ i * are slack variables that penalize predictions out of the ε-tube.
The optimization problem can be solved by introducing Lagrange multipliers, which leads to the approximation function in the form: where α i , α i * are the Langrange multipliers for the optimal solution.Explicit calculation of the dot products Φ(x i ) T Φ(x j ) as well as the choice of an appropriate mapping are difficult and computationally complexed.Instead, the dot products are calulated with the use of kernel functions: MATEC Web of Conferences 196, 04090 (2018) https://doi.org/10.1051/matecconf/201819604090XXVII R-S-P Seminar 2018, Theoretical Foundation of Civil Engineering K(x i , x j ) = Φ(x i ) T Φ(x j ) (12) In turn the approximation function can be written finally as: )K(x i , x j ) + w 0 (13) where K(x i , x j ) is the chosen kernel function.
Before the start of machine learning all the values of models' variables were scaled to the range <0 ; 1>.(In case of numerical values linear scaling was applied, whereas the descriptive values were transformed with use of pseudo-fuzzy scaling.)For the purposes of machine learning the whole set of training data, including 151 samples, was divided into the learning subset and testing subset in that took part in relation 70% to 30% respectively.
As the kernel function K(x i , x j ), radial basis function was chosen: Parameters of the model were selected in respect of 10-fold cross validation.The mesh of the parameters C and ε that were investigated was as follows: C varied between 5 and 30, ε varied between 0,1 and 0,3.The cross validation allowed to select optimal nod of the mesh.The process of SVR based machine learning was repeated several times for different variants of learning and testing subsets sampling.All of the computations were made with the use of STATISTICA ® software.

Results of the study
Table 2 presents characteristics of three models selected from the set of investigated models in terms of the number of support vectors and bounded vectors, as well as parameters C and ε.Additionally table 1 includes correlation coefficients between the real life values and predicted values of the dependant variable y.
Table 3 includes measures of cost prediction errors computed for the models A, B and C. The measures include mean squared error (MSE) and mean absolute percentage errors (MAPE).
(To be distinguished, the three models are given letters A, B and C respectively.In case of the values present in table 2 and table 3 the subscripts indicate that the values were calculated for learning subset -subscript L, testing subset -subscript T or learning and testing subsets together -subscript L&T.)Both number of support vectors and bounded support vectors do not differ significantly between models A, B and C. The complexity of models also varied very little -C ranges between 27 and 30, whereas the value of ε is the same the three models.Correlation coefficients R are almost the same in case of the three selected models.The overall measures of models errors, as presented in the table 3, do not vary between the models A, B and C considerably.However one can see that for model A MSE values computed for learning and testing were almost the same.In case of model B MSE was smaller for learning subset, in case of model C MSE was smaller for testing subset.For all models MAPE values were smaller for testing subsets.
Figure 1 depicts scatter plot of the real life values of the construction cost of residential buildings (introduced as y) and the costs predicted by the models (introduced as ŷ) for the samples belonging to the testing subset.Figure 2 depicts scatter plot of the real life values of the construction cost of residential buildings (introduced as y) and corresponding absolute percentage errors (introduced as APE) calculated for the samples belonging to the testing subset.In the figure 1 one can see that the points are distributed along the line y = ŷ that indicates perfect fit.Points are distributed evenly on both sides of the line, so the cost predictions made with the use of the model for the testing subset belong either to overestimates or underestimates.
In the figure 2 the distribution of APE shows that the predictions made with the use of models are more reliable for residential buildings for which the real life values y pertain to the range 30 -55 million PLN (APE values in the range 0 to 10% for the testing subset).Less reliable predictions can be expected for the buildings that real life construction cost range between 20 -30 million PLN (APE values in the range 0 to 15% for the testing subset).Finally the least reliable predictions can be expected in case of buildings for which real life construction cost belong to the range 5 -20 million PLN (most of APE values in the range 0 to 25%, some in the range 25 to 35%, for the testing subset).

Summary
The research investigated the use of support vector regression based machine learning for the purposes of conceptual cost estimates of residential buildings.The paper presents results for the three models selected from a number of examined ones.In case of all of the selected models radial basis kernel function was applied, parameters C and ε have been found with the use of cross validation.
Parameters of the selected models were similar.Correlation coefficients between the real life values and models' predictions were almost the same for all models.The overall errors' measures (MSE as well as MAPE) do not differ significantly.Analysis of the APE errors allowed to specify the reliability of models' predictions for the clusters of buildings, distinguished in respect of the construction cost.
In the light of the results' analyses quality of the developed models can be considered satisfactory.Obtained results justify the statement that application of support vector machine learning approach can bring benefits for the conceptual cost estimates.

Fig. 1 .
Fig. 1.Scatter plot of the real life values of cost and cost predictions made by the models A, B ad C for the testing subset.

Fig. 2 .
Fig. 2. Scatter plot of the real life values of cost and absolute percentage errors of cost predictions made by the models A, B and C for the testing subset.

Table 2 .
Characteristics of selected models.

Table 3 .
Cost prediction errors computed for the selected models.