An Evolving Hypernetwork Model to Quantify Progress Potential of Emerging Research Topic

There is considerable and growing interest in the emergence of research topics. However, current methods to detect the emergence are still problematic mainly due to information loss and aging effect. In this study, we show three intrinsic mechanisms including preferential attachment, exponentially growth and heterogeneous fitness values that decay with time. Depending on the input assumptions, all topics tend to follow a universal temporal pattern according to our model which results in strongly sufficiency to quantify progress potential.


Introduction
There is considerable and growing interest in the emergence of research topics, both from the policymaking and academic perspective. Numerous studies and projects have proposed definitions and models to detect, track, and predict emerging topics. For instance, the Comprehensive Strategy on Science, Technology and Innovation of Japan in 2017 (STI 2017) and the Annual Report on the Top 10 Emerging Technologies of the World Economic Forum [1]. Evaluating innovation potential is essential, not only for promoting efficiency of scientific research but also for detecting emerging research topics with strategic significance [2]. The intrinsic improvability of a new research area allows decision makers to track its evolutionary trajectory, or progress potential. It mainly includes two reasons. For one, the studies always attempt to deal with a practical or scientific problem. For another, the current study is affirmably carried out on the basis of existing one, and trying to improve the science and technology level through cooperation.
We contrast this problem of quantifying the innovation potential of emerging research topic based on its intrinsic improvability, which generally focuses on exogenous, market-driven factors or experts factors, for which bibliometrics and Delphi have played a key role. Many researchers utilize various types of relations among publications to establish network based models of bibliometrics, such as bibliographic coupling, co-citation analysis and others [3]. Different citation relations can all be used to aggregate publications. In the case of textbased approaches [4], some studies have established clusters based on the co-occurrence of terms [5], using a large-scale database [6]. However, the effectiveness of any approach for identification is extremely difficult to verify. And such approaches tend to focus on measuring the attributes of novelty and fast growth, and to ignore the consideration of other potentially important attributes, such as impact [7]. After the small-world model proposed by Watts and Strogatz, Barabá si and Albert (BA) proposed the scale-free network [8]. The academic community has set off an upsurge of research on complex networks. Despite its different applications, it is still hard to depict some real-life systems.
All in all, traditional homogeneous complex networks have following limitations. There is no consensus on the concept of emergence especially the impact of cooperation. What's more, bibliometrics indicators such as the number of papers or citations are poor prediction of topic's future [9] in Fig 1. The number of scientific literature for each research point relies on the item's age [10]. As a result, the older it is, the more likely it is to be favored, which violates to the intuition that people prefer the emerging things [11]. In the experiment, we regarded Medical Subject Headings (MeSH) term as research topic. PubMed databases of the NCBI (National Center for Biotechnology Information)provide freely source in a well formatted structure. (which is publicly available at http://www.ncbi.nlm.nih.gov/pubmed/advanced) [12]. Moreover, it is also short of predictability: a subset of research points that collect nearly 500 papers in 1966 own widely different evolutionary potential (Fig. 1, inset). To avoid information loss and aging effect, it naturally leads to the emergence of hypernetworks. The concept of hypernetworks is a natural multidimensional generalization of networks and represents n-dimensional relations. The evolving hypernetworks in the existing literatures are almost all uniform, except Guo and Zhu [13] considering the characteristics of non-uniformity and weight of hyperedges. In the non-uniform model, at each time step, both the size of new nodes and the randomly selected existing nodes in one hyperedge are random variables. 2 The evolving hypernetwork model

Fundamental mechanisms
We started by hackle intrinsic mechanisms to drive new research topic evolving. Intuitively, the life curves of different MeSH terms vary widely in Fig. 1. Some topics are highly valued when they are first proposed. While, some of them have been dormant for a long time before they catch the attention of people, that is, the curve rises out rapidly after a period of time. Some have been developing steadily, that is, the curve flattens out. In addition, in terms of total quantity, some of the hot topics exist in dozens or even hundreds of times of literature, compared to the general topic.
Preferential attachment captures the fat-tailed nature of the publication distribution which several hotspots are more likely to be used and finally result in 'the rich get richer' 14. Despite its rationality in comparing the empirical data and theoretical models, the first-mover advantage of preferential attachment results from strong time bias in the system.
Secondly, unlike the traditional uniform network, the total amount of publications each year is not the same, but increases exponentially year by year15.
where   Nt is the amount of documents at the time t ,  is the speed of the literature growth. Then the scholars can noticed that the proportion of the literature must be reduced rapidly. Furthermore, the evolving network in the existing publications is non-uniform. That is, the number of research points of each paper is not a constant which is determined by a given distribution function like Poisson function. There are approximately 20 MeSH terms in each publications of 'CELL', new nodes entering into the network or previously existing nodes selected may not be the same in  Old research points will be replaced by new ones eventually. Therefore, in the long run, the new research subject will dies slowly and the total amount of relevant literature will also tend to be stabilized, such that the trend is similar to the shape of the normal distribution. But it is difficult to determine when this distribution begins, due to the difference in the innovative characteristic of the new research point, i. e. Is it a standard normal distribution, left or right?
In the beginning, the research pioneers creatively propose a new research idea and relative researchers with the early derivation of theories. Once forming the theory and receiving more attention of experts in the field, which usually takes a long time, much more researchers, on the basis, start a large number of extension and application researches, until the hotspot is replaced by another. For instance, there is a new MeSH term in 2015 called latent autoimmune diabetes of adults in

Theoretical analysis
Now we proceed to an evolving hypernetwork model based on the above-reported empirical observations. Initially, we assume the hyperdegree 0 t i h initially equals to 1. The hyperdegree is the number of hyperedges that connect to the node, and the number of related papers in the real world. At time t, introduced. The number of each paper k is determined by probability   pk . The time evolution of expected where u indicates the time for a paper to reach its citation peak,  is longevity, capturing the decay rate,  is the degree of skewness. The equation represents the evolution of each node's hyperdegree depends not only on itself but also on the current hyperdegree and relevance of all other nodes. Two simplifications could be introduced to solve this partial differential equation. At first, the number of MeSH terms each new paper contains is subordinated to an extremely narrow Poisson distribution    

Results
For the purpose of this study, we selects the scientific data of medical field from 1946 to 2017. PubMed databases in the National Center for Biotechnology Information (NCBI) provide free publications data. To illustrate the universality of the model, we choose four data with completely different characteristics in Fig.  4. First, as illustrated, the red line, corresponding to the key word 'Pyrimidine Nucleotides', has grown rapidly in the early period and has also declined significantly in the later period. The blue line, corresponding to the key word 'Hexosyltransferases', has been steadily rising, only to decline sharply in the end. The entire life curve of the purple line is relatively stable, which corresponds to 'Leukemia Virus, Murine', initially maintaining a small increase in the early period, and then gradually decaying. The green line, corresponding to the key word 'Epidermal Growth Factor', is not the same as the above three. There was a long period of incubation in the early period, followed by a rapid increase and then a slow decline. They can basically summarize the developing characteristics of all various data. Simultaneously, the large amount of these four data further enhances their persuasiveness and representation. The Fig.4 Inset shows the cumulative amount of the four life curves.  We have obtained a model suitable for this paper. The fitting results of this model we got are shown in Fig. 5. The solid line corresponds to our model fitted with the data. The dashed line denotes the life curve of the four real data, respectively. Obviously, the degree of fit between the solid and dashed lines is high, especially the blue line (corresponding to 'Epidermal Growth Factor') and the green line (corresponding to 'Leukemia Virus, Murine'), which are almost perfectly fitted. The second is the red line(corresponding to 'Hexosyltransferases').
The  and  value of 'Hexosyltransferases' are highest for its continuously innovative ability, It has been rising from 1950 and taking a long time. 'Epidermal Growth Factor' has the smallest  value, and its influence is mainly concentrated in the early stage since its appearance.

Comparison and analysis
The observed accuracy prompts us to compare with other models. We therefore identified several models that the others have been used in the past to fit histories the Model (Wang, 2013), the Logistic model and Gompertz model as shown in Table 2. Correlation coefficient is the state or relation of being correlated. Specifically, a relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone. In Fig. 6, the correlation is between the model fitting and the actual trends of data. That reflects the approximate degree of the model and the actual situation. This shows that the stronger the correlation is, indicating that the model is more accurate. From the Fig.6, we can clearly see that the correlation of our model, Model (Wang) and Gompertz Model is relatively close, but our model is still better than the other two models. And, the effect of Logistic Model is not satisfactory. That means, our model clearly outperforms the other three. At the same time, we can also see that the secondoptimal model is Wang's model. The effects of the remaining two models are slightly worse.

Discussion
Under the premise of the same number of parameters, the fitting effect of our model is generally superior to others. It benefits from the understanding of the evolutionary mechanism of research topics and the restoration of information as possible. The innovation  , duration  , and interval of the topic u will affect the ultimate development of itself. Therefore, the best choice for new researchers is the research point with both the larger  and  , which means large research scale and continuous innovation.
And it is worth emphasizing that although  does not affect the total number of final documents, it still is of great significance. ω can be a measure of the time efficiency of this topic being recognized. The bigger ω is, the shorter time it takes. When ω is more than 1, it can be thought that a number of researchers followed suit as soon as proposed. The smaller ω is, the more gradually it increases at first. When  =0.2 and μ=0, the effect of ω is shown in the Fig. 8. Based on comprehensively analyzing of the literature amount of Moore's law, the influence of time affect is neither exponential decline nor left normal distribution. Citation-based measurement, which used to gauge impact, including analysis of impact factors to citation of short-term papers, always lacks predictability. In this paper, we propose a citation model with long-term predictability. Meanwhile due to little is known about the mechanism of time evolution of individual papers in the past research. Here, we derive a mechanistic model for the dynamics of citations in individual papers, allowing us to collapse the citation history of papers from different journals and disciplines into a single curve, indicating that all papers tend to follow the same universal time model.
The observed patterns help us to discover the basic mechanisms of management science impact. We can use this universal model to quantify progress potential of emerging research topic to predict the emergence of emerging technologies within a certain discipline. In combination with expert experience, we will select emerging technologies that will have the most development impact in the future. Through the government's vigorous policy support and social resources support, it will lead to the emergence of emerging technologies in advance. Therefore, it is doubtless to have tremendous boost to the development of the discipline, and even the entire history of science and technology.
The model proposed in this paper can fit the data of different characteristics well. In particular, it solves the problem of 'Sleeping Beauties' mentioned by Wang, whose reason is that the Model (Wang) limit the time to the early stages of publication with the Log function. The fundamental reason why logistic regression fails to fit well is that it's characteristics of 'fast growth and fast disappearing'. Although our model is highly applicable, it is not completely universal, such as neural network, which experienced several degrees of sinking and developing. At the same time, we also observed that the early prediction is not good enough when the study was not followed for a long time. Therefore, it's better to analyze the data which has been accumulated for a period of time. This paper involves only the validation and analysis of the model, and uses only one type of data. Next step, the model will continue to be expanded in terms of data types. At the same time, we will focus on the predictability of the model to explore the future research.