Research on Tourism Network Index Model Based on Baidu Index --- A Case Study of Sanya

. Tourism index is a "barometer" to reflect the overall development level of tourism. The tourism index compiled by historical data can not reflect the real situation accurately with the increasing influence of network events on tourism In the Internet era. This study collects time series data of tourism network search by Baidu index tool and uses data mining method and principal component analysis method to detect and standardize the stability of the data. The spss system and the weighted analysis method are used to construct the tourism network index model. Finally, the model detection is carried out by comparing the actual tourism data in Sanya. This study is an important supplement to the existing tourism index.


Introduction
The tourism network index is a collection of emotions, opinions, attitudes, viewpoints, etc. which are expressed by Internet users around tourism emergencies and hot spots. CNNIC research shows that More than 80% of tourism users obtain tourism information through the Internet. Big data on the Internet has a stronger tendency, immediacy and predictability than the historical data lagging behind. This study collected more than 3 years time series data of tourism related keywords in Sanya by Baidu index tool, detected and standardized the data using data mining method and principal component analysis method, using spss system and adding the right analysis method constructs the tourism network index model, finally carries on the model inspection through the contrast Sanya actual tourism data. This study is an important supplement to the existing tourism index.

Keyword selection
Different keywords have different search frequency. The number of keywords must be rich and comprehensive.
(1) Through online and offline questionnaires and the <meta> tag of the top-ranked travel website HTML in recent years, the basic keywords of Sanya tourism were selected: Tianya Haijiao, Weizhizhou Island, Yalong Bay, Nanshan Temple, Luhuitou, Haitang Bay, Coconut Dream Corridor, Penang Valley, Dongtian, Dadong Sea, West Island.
(2) Using the "Station Master's House" tool, 12 basic keywords were extracted and 124 extended keywords were obtained.

Data detection
The time series data of 25 keywords from 2014 to 2016 were input into SPSS system and tested by KMO and Bartlett. The results showed that KMO = 0.761, 0.7 < KMO = 0.761 < 0.8, and the weight could be calculated by principal component analysis.

Principal component analysis
Principal component analysis showed that the characteristic roots of the four principal components of "Sanya Tourism Strategy", "Weizhizhou Island", "Tianya Haijiao" and "Yalong Bay" were more than 1.
The cumulative variance contribution rate of the first two principal components was 84.522%, more than 80%. Therefore, the first four principal components can basically reflect the information of all the indices, and can replace the original 25 indices ("Sanya Tourism Strategy", "Weizhizhou Island", "Tianya Haijiao", "Yalong Bay", "Sanya Tourism", "Yalong Bay Seabed World"...).

Correlation coefficient
The number of loads, or factor loads, represents the load of the first variable on the j common factor, reflecting the relative importance of the second variable on the j common factor.

Determine weight
Principal component analysis is used to determine the weight, that is, the index weight equals to the weighted average of the coefficients in the linear combination of the principal components with the variance contribution rate of the principal components.
(1) Coefficient of index in linear combinations of different principal components.
The linear combination of the four principal components is as follows: (2) According to the variance contribution rate of principal components, the coefficients of the comprehensive model are obtained.
"Initial eigenvalue" of the "variance%" represents the principal component variance contribution rate, the greater the variance contribution rate, the greater the importance of the principal component.
Variance contribution rate of four principal components The coefficient of index is the index in the linear combination of the four principal components.
The comprehensive coefficient of "Sanya tourism strategy": Similarly, the coefficients of all indicators are calculated. The comprehensive score model is as follows: (3) Normalization of data Because the sum of the weights of all the indexes is 1, the index weights need to be normalized on the basis of the index coefficients in the comprehensive model.

Actual tourism data of Sanya
According to the statistics provided by Hainan, the number of visitors to Sanya 2014-2016 (monthly) is shown in Table  2.   Correlation analysis showed that the correlation coefficient was 0.820, significant P = 0.000 < 0.01, with statistical significance, which proved that there was a correlation between the two. It is proved that the actual traffic volume of Sanya can be reflected through Baidu index data.

T test
In order to further mine the characteristics of Baidu index and Sanya tourist volume, this paper tests three groups of data from 2014 to 2016. The results show that from 2014 to 2016, the Sanya composite index and the average and standard deviation of the actual number of people showed an increasing trend. The contrast between Sanya's actual tourist season and the peak season is getting bigger and bigger. The sig value is less than The significance level 0.05, and The sig value becomes smaller over time. The correlation coefficient showed an increasing trend and the correlation increased year by year.

Predictive analysis
Excel is used to deal with the statistics of the actual number of people and the composite index, and the resulting curve is transformed and translated to the following results: By shifting the composite index one unit forward on the coordinate axis, it can be seen from the graph that after the translation transformation, the two curves of SJRS and ZHZS are more consistent, and the peaks and turning points are more consistent. ZHZS moved forward by a unit, in fact, ZHZS ahead of a month, we can see that ZHZS for the number of visitors to predict a certain premonition, about a month in advance.

Conclusions
(1) The search volume of key words in Sanya scenic spot combination is related to the monthly tourist flow in Sanya, and with the passage of time, social progress, information technology developed, this correlation shows a growing trend.
(2) Of all the relevant keywords, the search volume of "Sanya Tourism Strategy" is far ahead, the number of individual tourists exceeds the number of team tourists, and the proportion of individual tourists is increasing year by year.
(3) The dispersion of the actual number of tourists in Sanya is increasing year by year, and the contrast between the actual tourist season in Sanya is getting bigger and bigger, which is very unfavorable to the use of resources. The peak season scenic spot pressure is enormous, for Sanya traffic and other aspects of great challenges, but the off-season is bleak business.
(4) Through the analysis of the comprehensive index and the actual number of people in Sanya, the data of Baidu index can predict the actual number of people in the scenic spot, and has a certain lead time. According to the statistical results of this paper, the lead time is about one month.