Partial Least Square with Savitzky Golay Derivative in Predicting Blood Hemoglobin Using Near Infrared Spectrum

Near infrared spectroscopy (NIRS) is a reliable technique that widely used in medical fields. Partial least square was developed to predict blood hemoglobin concentration using NIRS. The aims of this paper are (i) to develop predictive model for near infrared spectroscopic analysis in blood hemoglobin prediction, (ii) to establish relationship between blood hemoglobin and near infrared spectrum using a predictive model, (iii) to evaluate the predictive accuracy of a predictive model based on root mean squared error (RMSE) and coefficient of determination . Partial least square with first order Savitzky Golay (SG) derivative preprocessing (PLS-SGd1) showed the higher performance of predictions with RMSE = 0.7965 and = 0.9206 in K-fold cross validation. Optimum number of latent variable (LV) and frame length (ƒ) were 32 and 27 nm, respectively. These findings suggest that the relationship between blood hemoglobin and near infrared spectrum is strong, and the partial least square with first order SG derivative is able to predict the blood hemoglobin using near infrared spectral data.


Introduction
Hemoglobin concentration can be used to diagnose anemia [1].Hemoglobin (Hb) is a protein molecule in red blood cells that contains an iron molecule to carries oxygen from the lungs to the rest of body [2].Generally, people with hemoglobin level 12.0 g/dL or higher was defined as non-anemia, 10 to 11.9 g/dL was mild anemia, 9.9 to 7.0 g/dL was moderate anemia and lower than 7.0 g/dL was severe anemia [1].Normally, blood hemoglobin was measured using Cyanmethemoglobin method with some blood drawn from patient to be mixed with reagent chemicals for analysis [3].However, this method has limitations in term of time consuming, require reagent chemical in analysis and invasive method.In this way, NIRS method is a promising fast response, noninvasive and prominent technique to measure blood hemoglobin.
NIRS is a simple and reliable technique widely used in various field such as medical [4][5][6][7][8][9][10], food [11][12][13][14][15], agrochemical [12,13,15], and fuel [16].NIRS measures overtones and combination tones of the fundamental molecular vibrations in especially the asymmetric vibrations and these properties make NIR useful for analyzing in biological system [17].However, different reviews have demonstrated that fundamental reasons which limit the use of NIRS based on several factors; interference resulting in poor S/N ratio, calibration issues, baseline drift, thermal noise and proper selection of wave-length [18].These problems of NIRS can degrade accuracy performance of the predictions.Due to these issues and nonlinearity of NIRS spectral data, multivariate calibration modelling method need to be developed for quantitative analysis of the target component in complex samples.Preprocessing, calibration and validations is a common process rely in developing multivariate calibration modelling [19].Different types of multivariate calibration methods have been applied into NIRS spectral data to extract the relevant part of information for a large dataset to predict concentration from samples [16,20].The major concern for these multivariate calibration methods with spectral data in data nonlinearity [20,21].An appropriate calibration modelling need to be investigated to give an optimum performance predictions from the spectral sample data.Partial least squares (PLS) was famous predictive modelling used in NIRS due to its advantages of rapidity, simplicity and practicability [21,22].With a linear method combination of principal multi linear regression (MLR) and component analysis (PCA), PLS be able to handle data with strong co-linearity and noise, as well as in situations with the number of variables more than the number of samples [22].PLS model show superior model in number of component in terms of effectiveness compared to multi linear regression (MLR,) principal component regression (PCR) [20].Moreover, PLS has shown much better performance compared to artificial neural network (ANN) in term of RMSEP [24].However, conventional PLS model need to use prior preprocessing step result to confront with the change in interferent structure in the test set and reducing the prediction error [8].
SG preprocessing method has been used successfully to remove unwanted signal from spectral data and overcome most common issues in raw spectral data from NIR [5,10,14,15] .However, little studies have been conducted to investigate the effect of the preprocessing to the predictive accuracy of predictive models [25].Thus, PLS combined with different SG preprocessing techniques (i.e.smoothing, first and second order derivative) is proposed to be investigated to predict the blood hemoglobin concentration in this study.

Research methodology
Figure 1 describe the general idea of the research methodology in this research.Raw spectral data from near infrared spectroscopy (NIRS) will be preprocessed with SG derivative.After that, PLS multivariate calibration will be used for modelling the spectral data and generate prediction value of blood hemoglobin.

Spectral data
The spectral dataset were adopted from IDRC ShootOut 2010 that provided by Karl Norris [26].Blood samples were analyzed during the period from 1990 to 1992 with an NIRSystems 6500 spectrometer with a transmission amplifier mounted in the sample transport.All spectra have 700 variables, from 1100 to 2498 nm, with a 2 nm interval as shown in Fig. 2. The data set contain n=231 sets for calibration and n=194 testing sets for blind test used to measure concentration of blood hemoglobin from blood constituent.The blood hemoglobin reference method was measured by Coulter STKS monitor, which is made by the Coulter Corporation of Hialeah, FL [27].Descriptive statistics of the samples and reference blood hemoglobin showing, number of samples (n), minimum (Min), maximum (Max), mean and standard deviation (Std) as shown in Table 1.

Savitzky Golay preprocessing
Preprocessing works to performed data loading, preprocessing zero order SG, first order SG derivative and second order SG derivatives to remove unwanted signal before the spectral data going to modelling process.Sets of 231 sample data was processed with three level of SG preprocessing method produced smoothing SG (SG0), first order SG derivative (SG1) and second order SG derivative (SG2) data.The coefficients of SG (C0, C1 and C2) were generated by using built in matrix routine function from MATLAB simulation software (MATLAB® Version8.4(R2014b)).Middle value from desired order derivative can be estimated by dot product of each value of C0, C1 and C2 represented coefficient differentiation filter with spectral data using following equation: ( Where is set of SG coefficient; is related set of data before treatment; is observed value after treatment.The range spectral data was treated between set data and .Where and is measured number and total number of frame length.

Partial least square
Partial least square was carried out using MATLAB simulation software (MATLAB® Version8.4(R2014b)).General concept idea behind model of PLS modelling is to decompose both the design matrix predictor and matrix of response as following equations: (2) (3) Where is an matrix of predictors, is an matrix of response.and is matrix that are projections of score and score respectively.and are and orthogonal loading matrices respectively.The algorithm will yield the PLS regression estimates and after estimating the factor and loading matrices , , and for the linear regression as following equations: (4) Where and is PLS regression coefficient.
is predicted value of blood hemoglobin.In this research, the coefficients of PLS regression were generated by using the MATLAB built-in matrix routines function from MATLAB.PLS with SG preprocessing have four stage process start with preprocessing, training, validation and testing process.

Validations
K-fold cross validation has been used to evaluate performance of PLS model [27,28].There are three steps in K-fold cross validation.First, the data set was randomly divided into 5 disjoint folds with approximately equal size.Second, each fold turn to be test the model induced from the other k -1 folds with certain arrangement.After that, root mean square error of cross validation (RMSECV), root mean square error of prediction (RMSEP) and coefficient of determination of prediction can be determined to characterize prediction accuracy capacity of created model.The RMSECV, RMSEP and were calculated as follows: (5) where is the total number of samples, and denote the predicted blood Hb and reference Blood Hb from calibration data set, respectively.While root mean squared error of prediction (RMSEP) is used to measures the accuracy of the predictions of the calibration model with new unseen of data set can be computed as (6) where is the total number of samples, and denote the predicted blood Hb and reference Blood Hb from new unseen data set, respectively.The coefficient of determination of prediction used interpreted proportion of the variance in the predicted from reference value output of regression analysis is defined as (7) Where is mean of reference data, and denote the predicted blood Hb and reference Blood Hb from new unseen data set, respectively.SSE and SST denoted as residual sum of squares and total sum of squares respectively.

Savitzky Golay preprocessing
Fig. 3 shows the output of smoothing SG after raw spectra data from near infrared spectroscopy has been applied smoothing SG preprocessing.The result indicates that the spectra has been smooth and signal to noise ratio (SNR) has been increase without greatly distorting the original signal.
The raw spectra data has been treated with a set of 37nm frame length to produce convolution coefficients relatively.Thus, 38nm frame length from starting point (1100 to 1176nm) and end point (2422 to 2498nm) has been remove because convolution coefficients that has been applied to all data sub-sets to give estimates of the smoothed signal at the central point of each subset for 37nm frame length.It means 76 from 700 variable information of spectra data has been lost during preprocessing.Therefore, the use of number of frame length should be optimized to avoid more elimination information during preprocessing.Output spectra data after first order SG derivative preprocessing (SGd1) as shown at Fig. 4. The result indicate that baseline shift effect has been remove after the SGd1 process.The useful information of original spectra still available for modelling is between range (1306 to 1630 nm, 1824 to 2150 nm and 2232 to 2460).
In spite of that, all information spectral data from 1136 to 2460 nm can be used for modelling process.While Fig. 5 indicate more information from spectral data has been neglected after second order SG derivative preprocessing has been applied.

Performance calibration of SG-PLS
Table 2 shows the RMSECV of PLS with smoothing, first order and second order SG derivative preprocessed spectral data with optimal filter and different LVs.As can be seen, best five number RMSECV from each type of preprocessing were presented.It indicated best of 25 models have been developed.Minimum value of RMSECV is 0.2163 gd/L follow by 0.2178 gd/L while highest is 0.2238 gd/L.Generally, value of RMSECV of 25 selected models is not much different with 2.18% different.Optimization on the use of an appropriate of frame length after PLS-SGd0 pretreatment resulted in high scale between 93nm to 99nm.This may happen because of the data still have baseline shift and slope effects after smoothing pretreatment process.
RMSECV value for PLS-SGd1 having a minimum number of frame length 37nm and range number of LVs is between 21 and 37.This slight improvement performance might have resulted from removing the slope effect.When higher filter length used to optimize performance of the model, the more original signal will be neglected because new value after preprocessing is a central value from dot product of convolution coefficient with spectral data.Optimal frame lengths were used for PLS-SGd2 are between 79nm to 99nm and LVs are between 31 to 33.It indicated that much information has been lost due the preprocessing process.Different order of SG preprocessing influencing number of frame length and latent variable used to perform optimum RMSECV.

Performance prediction of SG-PLS
192 new unseen samples from calibration has been used to measured performance prediction of PLS.From the result, indicates that optimization of frame length and number of LVs can improve performance of modelling for example PLS-SGd1 showed the higher performance of predictions (ƒ = 27, LV = 32, r 2 p = 0.9206) followed by PLS-SGd0 (ƒ = 77, LV = 35, r 2 p = 0.9190) as shown in Table 3.This could be due to the minimum number of frame length and lower number of LVs used.The results indicate that PLS-SGd1 have more robustness in testing new unseen test sets samples where the samples were frozen for storage and thawed before used for analysis.

Conclusions
As a conclusion, partial least square modelling is promising to predict the blood hemoglobin from near infrared spectral data.With the optimal frame length and latent variables, partial least square with first order Savitzky Golay derivative preprocessing (ƒ = 27, LV = 32) showed the highest performance of prediction, i.e.RMSEP = 0.7965 gd/L and r 2 p = 0.9206 in K-fold cross validation.Thus, optimization in frame length and latent variable is crucial to improve the prediction performance.Next, findings also show that the smaller size the frame length, and the lower the number of LVs give a better prediction.

Fig. 3 .Fig. 4 .Fig. 5 .
Fig. 3.The preprocessed spectral data after SG smoothing The author would like to acknowledge Research and Innovation Fund provided by the Office for Research, Innovation, Commercialization and Consultancy Management (ORICC), RMC, Universiti Tun Hussein Onn Malaysia (UTHM) for providing financial support, and Faculty of Electrical and Electronic Engineering, UTHM for providing facilities for this study.

Table 1 .
Descriptive statistics of the blood hemoglobin

Table 2 .
RMSECV of PLS with Smoothing, 1 st and 2 nd Order SG Derivative pre-processed spectral data with optimal filter and different LVs.

Table 3 .
The Performance of PLS with different SG preprocessed spectral data with optimal filter and latent variable (LV) using Cross Validation