MATEC Web of Conferences
Volume 22, 2015International Conference on Engineering Technology and Application (ICETA 2015)
|Number of page(s)||5|
|Section||Information and Communication Technology|
|Published online||09 July 2015|
Microblog Hot Spot Mining Based on PAM Probabilistic Topic Model
Chongqing University, Chongqing, China
Microblogs are short texts carried with limited information, which will increase the difficulty of topic mining. This paper proposes the use of PAM (Pachinko Allocation Model) probabilistic topic model to extract the generative model of text’s implicit theme for microblog hot spot mining. First, three categories of microblog and the main contribution of this paper are illustrated. Second, for there are four topic models which are respectively explained, the PAM model is introduced in detail in terms of how to generate a document, the accuracy of document classification and the topic correlation in PAM. Finally, MapReduce is described. For the number of microblogs is huge as well as the number of contactors, the totally number of words is relatively small. With MapReduce, microblogs data are split by contactor, document-topic count matrix and contactor-topic count matrix can be locally stored while the word-topic count matrix must be globally stored. Thus, the hot spot mining can be achieved on the basis of PAM probabilistic topic model.
Key words: microblog / hot spot / PAM probabilistic topic model / MapReduce
© Owned by the authors, published by EDP Sciences, 2015
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.