Research on dairy products detection based on machine learning algorithm

. In this study, an electronic nose model composed of seven kinds of metal oxide semiconductor sensors was developed to distinguish the milk source (the dairy farm to which milk belongs), estimate the content of milk fat and protein in milk, to identify the authenticity and evaluate the quality of milk. The developed electronic nose is a low-cost and non-destructive testing equipment. (1) For the identification of milk sources, this paper uses the method of combining the electronic nose odor characteristics of milk and the component characteristics to distinguish different milk sources, and uses Principal Component Analysis (PCA) and Linear Discriminant Analysis , LDA) for dimensionality reduction analysis, and finally use three machine learning algorithms such as Logistic Regression (LR), Support Vector Machine (SVM) and Random Forest (RF) to build a milk source (cow farm) Identify the model and evaluate and compare the classification effects. The experimental results prove that the classification effect of the SVM-LDA model based on the electronic nose odor characteristics is better than other single feature models, and the accuracy of the test set reaches 91.5%. The RF-LDA and SVM-LDA models based on the fusion feature of the two have the best effect Set accuracy rate is as high as 96%. (2) The three algorithms, Gradient Boosting Decision Tree (GBDT), Extreme Gradient Boosting (XGBoost) and Random Forest (RF), are used to construct the electronic nose odor data for milk fat rate and protein rate. The method of estimating the model, the results show that the RF model has the best estimation performance( 2 =0.9399 R for milk fat; 2 =0.9301 R for milk protein). And it prove that the method proposed in this study can improve the estimation accuracy of milk fat and protein, which provides a technical basis for predicting the quality of dairy products.


Introduction
In addition to water, fat, phospholipid, protein, lactose and inorganic salt, milk also contains at least 100 kinds of chemical components, the content of which is very complex [1]. The mixture of low-grade fatty acids, acetone, acetaldehyde, carbonic acid and other volatile substances in milk affects the flavor of milk, and sulfide is the main component of fresh milk flavor [2]. Dairy cows in different farms have different flavor due to different feed and growth environment [3]. Milk protein, milk fat and lactose are the key indicators to evaluate the quality of milk [4]. The degradation of their components or the interaction between their derivatives affect the flavor compounds of milk [5] [6]. Therefore, the establishment of milk detection model is of great significance for the identification of dairy farms and the improvement of milk quality.
The traditional way to trace the origin of milk is through physical tracking such as manual recording. In recent years, many chemical methods have been used in milk origin identification, such as stable isotope ratio analysis, trace element content analysis, nuclear magnetic resonance, etc The methods of milk quality detection are mainly divided into two aspects: one is the detection of milk freshness; The second is the identification of milk components. Sensory evaluation is the most direct way to judge the freshness of milk. The method can judge whether the milk is deteriorated by observing the physical information such as color, smell and condensation state of the milk. However, the accuracy of this method is low. In order to further improve the detection accuracy, physical and chemical analysis method was used for milk quality detection.
At present, near infrared spectroscopy [10], microbiological physicochemical analysis [11], DHI (dairy production performance measurement) laboratory detection [12] and other methods are used in domestic and foreign research to realize the quantitative detection of milk components, and have achieved good results. But these methods are expensive, low detection efficiency, vulnerable to damage, unable to achieve real-time detection of dairy products. Therefore, it is very important to find a fast and efficient nondestructive testing method.
As a new gas detection and analysis instrument, electronic nose has strong portability and simple operation, which makes food nondestructive testing easier [13] [14] [15]. Electronic nose is a kind of electronic instrument simulating human olfactory. It is an ideal digital electronic device, which can quickly evaluate complex volatile gas mixture. At present, it has been widely used in milk recognition [16], discrimination [17], detection [18]. The array composed of multiple sensors makes up for the defect of single sensor, which can detect different components in the gas at the same time. Although electronic nose has obtained some research results in dairy detection, it is still a systematic and complex project to use electronic nose technology to detect dairy products, and most of the current research only focuses on the single feature of dairy products, and lacks the systematic analysis of electronic nose [19]. Therefore, this paper proposes a rapid detection method based on electronic nose technology and machine learning for milk source (cattle farm) recognition and milk fat rate, protein rate content evaluation. Three different classification algorithms including logistic regression (LR), support vector machine (SVM) and random forest (RF) were used to build the milk source recognition model, evaluate and compare the classification effect of the model. Gradient lifting tree (gbdt), extreme gradient enhancement (xgboost) and random forest (RF) are used to build the estimation model of milk fat rate and protein rate to improve the accuracy of evaluation, verify the effectiveness of electronic nose detection method and realize equivalent detection.

Independently developed electronic nose model
Electronic nose is an electronic system which simulates the olfactory organs of animals and uses the response image of sensor array to identify odor. The electronic nose model used in this paper (30cm in length, 20cm in width and 20cm in height) is composed of gas sensor array, signal acquisition module, data acquisition module and signal processing and pattern recognition module (Fig. 1  According to the sensitivity of each sensor in the array to the gas to be measured, the response is different, so the electronic nose system uses its response resistance value to identify the odor [20]. There are seven metal oxide sensors in the electronic nose. Table 1 lists the names of gas sensors and the corresponding sensitive substances. In the developed electronic nose system, the two main functions of Arduino software module are: (1) to obtain the response value of the sensor.(2) Process data and communicate with computer. The microcontroller on the development board is programmed by Arduino programming language, compiled into binary files, and passed MATEC Web of Conferences 355, 03008 (2022) ICPCM2021 https://doi.org/10.1051/matecconf/202235503008 into the microcontroller. The response values of each sensor in the sensor array to different volatile substances are digitally converted by a multiplexer analog-to-digital converter (ADC), and the obtained data are stored for subsequent computer analysis and identification, as well as the extraction of related features. The processed digital signal is transmitted to the host computer through the serial port, and finally presented in the serial port monitor.
The flow control unit in the electronic nose is responsible for gas capture and cleaning. The cleaning time was 60s, the gas capture time was 90s, and the gas flow rate was 1.1l/min.

Sample collection and data acquisition
Milk samples from 10 farms were selected. Firstly, the original samples were classified to remove the samples with low liquid level or unqualified temperature. In the DHI experimental instrument detection, the detection results occasionally appear zero value phenomenon, so it is necessary to remove the interference value before the experiment. Finally, 100 groups of milk samples were collected from each dairy farm, and 1000 groups of samples were collected from 10 dairy farms to detect the fusion characteristics of DHI and electronic nose. In the process of the experiment, the average value of the three measurements is taken to reduce the error.
The composition data of dairy products are measured by the imported biochemical detection equipment of DHI laboratory, including milk fat percentage (%), protein percentage (%), lactose percentage (%), total solids percentage (%), somatic cell count (* 104 / ml) and urea nitrogen (mg / dl). Milk fat contains linolenic acid, arachidonic acid, various fat soluble vitamins and phospholipids [21]. The content of fat and protein is an important indicator of milk quality, and the low ratio of milk fat to protein indicates that rumen acidosis is very likely in dairy cows [22]. The lactose content in milk is usually between 4.5% and 5%. Its content not only affects milk yield, but also relates to rumen function. Cells are the general name of macrophages, lymphocytes and polymorphonuclear neutrophils in milk. The number of somatic cells is an indicator of the degree of mastitis infection in dairy cows, representing the health status of milk and milk quality [23]. Milk urea nitrogen comes from blood urea nitrogen, and high urea nitrogen content proves that cows are more likely to suffer from acidosis [24] [25].
The electronic nose detection experiment was carried out in the environment of 22 and 19% humidity. 20ml of each milk sample was extracted and stored in a sealed test tube, standing for 10 minutes to ensure that the volatile matter of the milk sample filled the whole test tube. Before the volatile gas capture, clean the airway and gas chamber of the electronic nose with fresh air to eliminate the interference gas. During the detection, the electronic nose probe and the balance air pressure tube were simultaneously extended into the test tube headspace air. After the capture process, the gas was fully absorbed by the sensor for 2 minutes, and the voltage response value increased and tended to be stable. During the cleaning process, with the gradual removal of volatile gas, the response value decreased and stabilized to a constant value, completing a sample measurement.

Data analysis
In this experiment, 10 milk samples from different places were selected, and volatile gases were collected from milk samples by electronic nose, and the odor data were stored in computer. The model analysis was conducted with 1000 groups of data after standardized processing: 800 groups were training data and 200 groups were test data.
For the cattle farm classification model, principal component analysis (PCA) and linear discrimination analysis (LDA) are used to reduce the dimension of the data, and retain the For the fitting model of electronic nose and DHI, the fitting model is established by three regression algorithms: gradient tree (gbdt), extreme gradient enhancement (xgboost), random forest, and the fitting effect of the model is evaluated and compared by using evaluation indexes.

SVM
SVM is a supervised learning model, which can perform pattern recognition, classification and regression problem analysis. The principle of SVM is to find the separating hyperplane which can correctly divide the classes in the training data set and has the maximum geometric distance. For the nonlinear classification problem, the kernel (mapping) function of SVM can map the samples from the original space to the high-dimensional space, so that the samples can be linearly separated in the new space. The main kernel functions are linear kernel function, polynomial kernel function, Gaussian radial basis function and so on.

RF
Random forest is an important ensemble learning method based on bagging. It consists of many decision trees (CART). It can be used to solve classification and regression problems, has a strong anti noise ability, and can avoid over fitting. The process of building RF model is as follows: firstly, m sample points are extracted from training sample set s to form a new training subset; Secondly, a classification decision tree or regression model is established for each training subset, which is obtained by randomly selecting K features from all features as segmentation nodes. The output of the model is the category with the highest number of votes (classification) or the average output of each decision tree (regression).

LR
Logistic regression is a supervised machine learning algorithm used to solve classification problems. The principle is to find the minimum value of the loss function to make the prediction function more accurate, so as to achieve the purpose of classification. Penalty term is an important super parameter of LR model, and the solver parameters can optimize the loss function.
Logistic regression is a supervised machine learning algorithm to solve classification problems

Analysis of response curve and radar chart of electronic nose
According to the obtained electronic nose data, the continuous 90 s sampling values of one group of samples are randomly selected as the electronic nose response curve (Fig. 1). G/g0 is the ratio of the sensor response resistance value (g) of the gas acquisition to the sensor response resistance value of purified air (G0). As the sampling time accumulates, the g / G0 value of each sensor in the electronic nose changes. The sensor response value is stable at about 60s. Among them, the response values of sensors 2, 3, 1 and 6 vary greatly, and the response values of sensors 4, 5 and 7 change little or no change. The response steady state value of electronic nose sensor at 90s of a group of samples is selected in each cattle farm to make the electronic nose response radar diagram (Fig. 2). Each longitudinal axis represents a sensor. It can be seen that the response values of sensor 1, sensor 2 and sensor 4 are obviously different among different cattle fields. By observing the response curve of electronic nose and radar, different cattle farms can be easily separated. Therefore, it is proved that it is feasible to realize the recognition of cattle farm by using electronic nose model. In order to further prove the effectiveness of this method, more accurate analysis is needed.

Data dimension reduction results
PCA was used to reduce the dimensions of DHI fusion feature (6 dimensions), electronic nose fusion feature (7 dimensions) and DHI and electronic nose fusion feature (13 MATEC Web of Conferences 355, 03008 (2022) ICPCM2021 https://doi.org/10.1051/matecconf/202235503008 dimensions). After dimension reduction, the cumulative variance contribution rate of the first three principal components (PC) including sufficient effective information about the sample was 99.909%, 99.09% and 98.19% respectively. The contribution rates of PC1, PC2 and PC3 were 99.9%, 0.008% and 0.001%, respectively; The contribution rates of PC1, PC2 and PC3 were 88.38%, 7.58% and 3.13%, respectively; The contribution rates of PC1, PC2 and PC3 were 55.72%, 39.09% and 3.38% ( Figure 3 -5) In Figure 3, the distribution of DHI fusion features after dimensionality reduction is scattered, and farms cannot be distinguished according to these features. Compared with figure 3, the aggregation degree of dimension reduction results of electronic nose fusion features in Figure 4 is higher, but it is still unable to clearly distinguish the cattle farms. In Figure 5, the fusion effect of DHI and electronic nose is poor. PCA dimension reduction effect is poor, unable to achieve preliminary judgment.  Therefore, the cumulative variance of the first three principal components (LD) is 99.79%, 93.94% and 95.87% respectively. Among them, the contribution rates of LD1, LD2 and ld3 were 98.84%, 0.69% and 0.26% respectively; The contribution rates of principal components LD1, LD2 and ld3 were 84.63%, 8.48% and 3.83%, respectively; The contribution rates of LD1, LD2 and ld3 were 51.93%, 39.57% and 4.37% respectively( Figure 6 -figure 8) Although the original data is retained more completely after PCA dimensionality reduction, from the three cases after LDA dimensionality reduction, the difference of data distribution between different farms is very obvious, especially the fusion of DHI and electronic nose, which can achieve rapid differentiation, proving that the observed samples have enough representativeness, and LDA dimensionality reduction method can be applied to milk sample data.

Model validation and analysis
Each farm of 100 groups of samples, randomly divided into 80 groups of training samples and 20 groups of test samples. A total of 800 training samples and 200 test samples were collected from 10 cattle farms. Support vector machine (SVM), random forest (RF) and logistic regression (LR) were used to build cattle farm classification model. The accuracy of the test results is shown in Table 4, where the input is the fusion feature after dimension reduction by PCA and LDA. The classification effect of PCA dimensionality reduction model is worse than that of LDA dimensionality reduction model, because PCA does not consider categories in the dimensionality reduction process, while LDA is a supervised learning method, and each sample of its data set has a category output [26]. Compared with PCA method, LDA dimension reduction method is more suitable for milk samples, which proves the above point.
When the input is the fusion feature of DHI and e-nose after LDA dimensionality reduction, the classification effect of the model is the best, and the accuracy rate of support vector machine model and random forest model is up to 96%. When the fusion feature of electronic nose is used as input, the classification model based on SVM algorithm is the best, which is 91.5%. When DHI fusion features are used as input, the model classification effect is the worst. The experimental results show that the electronic nose can realize the accurate classification of cattle farms.

Fitting of electronic nose features and DHI features
In this paper, the odor characteristics of electronic nose are fitted with the corresponding DHI characteristics, and the fitting models based on different algorithms are established, and the fitting effect is analyzed. If the content of milk fat and protein in dairy products is too low, it can be inferred that the rumen function of dairy cows is poor and it is suspected of acidosis. Therefore, from the six indicators of DHI data, the indicators related to milk quality and dairy cow health, namely fat and protein, are directly selected as the fitting model characteristics. In order to explore the fitting effect of electronic nose features and DHI features, five evaluation indexes including mean absolute error (MAE), root mean square error (RMSE), coefficient of determination (R2), mean absolute percentage error (MAPE) and symmetric mean absolute percentage error (smape) were used to evaluate the fitting effect of different machine learning algorithms.
Using the above five evaluation indexes, the fitting effect of three algorithm models based on gradient lifting tree (gbdt), extreme gradient enhancement (xgboost) and random forest (RF) is evaluated, and the best model is selected. Milk fat rate and protein rate were selected as the output of the model, and electronic nose odor data was used as the input to establish the fitting model. The evaluation indexes of fitting models of milk fat rate and protein rate based on different algorithms are shown in Tables 5 and 6, and the fitting model results and error curves are shown in Figures 9 and 10 In Table 5 and 6, the fitting effect based on RF algorithm model is the best, MAE, MSE, MAPE and smape are smaller than the other two algorithm models, and R2 is the largest, close to 1. Therefore, the RF model is proved to be effective for fitting the electronic nose and DHI data.
From the fitting error curve of milk fat rate and protein rate, we can see the fitting effect of the three models intuitively. Among them, RF model has smaller prediction error and the best fitting effect, followed by xgboost and gbdt model. When the data mutation, linear regression and support vector machine model can't make accurate prediction, and RF can achieve accurate judgment.

Conclusion
In this study, an electronic nose model based on seven kinds of gas sensors, Arduino development board and flow unit is proposed to realize the differentiation of different milk farms and the fitting of electronic nose data and DHI data.
In the classification detection of cattle farm, LR, SVM and RF machine learning algorithms are used to build the model, and single DHI data, single electronic nose data and the combination of the two data are used as the input of the model. The accuracy of the model is based on 200 test samples. The results are as follows (1) In the data dimension reduction processing, LDA dimension reduction method is better than PCA dimension reduction method. The classification accuracy of LDA is also higher than that of PCA.
(2) The experimental results show that when the input data of the model is the combination of DHI and electronic nose after LDA dimensionality reduction, the classification effect of the model is the best, and the accuracy of SVM and RF model is as high as 96%; When the electronic nose data after LDA dimension reduction is used as the model input, the SVM model has the highest classification accuracy, which is 91.5%. The results show that the SVM model can effectively distinguish farms by electronic nose.
In the fitting of electronic nose data and DHI data, gbdt, xgboost, RF three algorithms are used to establish the fitting model, electronic nose data as input, milk fat rate, protein rate as output. The results are as follows (1) The fitting effect of RF model is the best, MSE, MAE, MAPE, smape are less than the other two algorithms, and R2 value is the highest, 0.9399 and 0.9301 respectively. Especially when the variable mutation, can make accurate judgment.
(2) The experimental results show that the RF fitting model can effectively fit the electronic nose and DHI data, but the fitting effect of each feature needs to be improved.