Website Clickstream Data Visualization Using Improved Markov Chain Modelling In Apache Flume

Clickstream data analysis is considered as the process of collecting, analysing and reporting the aggregate data about the web pages a visitor clicks. Visualizing the clickstream data has gained significant importance in many applications like web marketing, customer prediction, product management, etc. Most existing works employ different tools for visualizing along with techniques like Markov chain modelling. However the accuracy of the methods can be improved when the shortcomings are resolved. Markov chain modelling has problems of occlusion and unable to provide clear display of data visualizing. These issues can be resolved by improving the Markov chain model by introducing a heuristic method of Kolmogorov– Smirnov distance and maximum likelihood estimator for visualizing. These concepts are employed between the underlying distribution states to minimize the Markov distribution. The proposed model named as WebClickviz is performed in Hadoop Apache Flume which is a highly advanced tool. The clickstream data visualization accuracy can be improved when Apache Flume tools are used. The performance evaluation are made on a specific website clickstream data which shows the proposed model of visualization has better performance than existing models like VizClick. KeywordsClickstream data, VizClick, WebClickviz, Apache Flume, Markov chain, Kolmogorov– Smirnov distance.


Introduction
Click data analytics [1] devices to mine websites, social media and online transactions are helping companies maximize customer interactions.A clickstream is a series of page requests; every page requested generates a flag [2].These signs can be graphically represented for clickstream reporting.The principle purpose of clickstream taking after is to give webmasters understanding into what guests on their site are doing.There are two levels of clickstream investigation, traffic analytics and e-commerce analytics.Traffic analytics [3] operates at the server level and tracks what number of pages is served to the user, to what extent it takes each page to stack [4], how often the user hits the browser's back or stop catch and how much data is transmitted before the user moves on [5].E-commercebased examination [6] uses clickstream data to determine the effectiveness of the site as a channel-tomarket.It's concerned with what pages the shopper lingers on, what the shopper puts in or takes out of a shopping basket, what items the shopper purchases, whether or not the shopper belongs to a dependability program and uses a coupon code and the shopper's preferred method of payment [7].
Because an extremely large volume of data can be gathered through clickstream investigation, numerous ebusinesses rely on enormous data analytics and related apparatuses [8], for example, Hadoop [9] to help interpret the data and generate reports for specific areas of interest.Clickstream investigation is considered to be best when used in conjunction with other, more standard, market evaluation resources.Inaugurating clickstream or snap way data must be gleaned from server log files.Because human and machine traffic were not differentiated, the investigation of human snaps required a considerable effort.Subsequently, Javascript technologies [10] were developed which use a taking after cookie to generate a series of signs from browsers.
Analysing the information of clients that visit an organization website can be imperative in order to remain competitive [11].This analysis can be used to generate two discoveries for the organization, the first being an analysis of a user's clickstream while utilizing a website to reveal usage patterns, which thus gives a heightened understanding of customer behaviour [12].This use of the analysis creates a user profile that guides in understanding the types of people that visit an organization's website [13].Clickstream analysis can be used to predict whether a customer is likely to purchase from an e-commerce website.Clickstream analysis can also be used to improve customer fulfilment with the website and with the organization itself [14].This can generate a business advantage, and be used to assess the effectiveness of advertising on a web page or site.Clickstreams can likewise be used to enable the user to see where they have been and enable them to easily return to a page they have already visited, a capacity that is already incorporated in many browsers.
Unauthorized clickstream information collection is considered to be spyware.However, authorized clickstream information collection comes from associations that use select in panels to generate market research utilizing panelists who agree to share their clickstream information with other companies by downloading and introducing specialized clickstream collection agents.VizClick [16] attempted to visualize the website clickstream data using a systematic approach which was performed on www.adobe.com to analyse the market behaviour of customers.However this model does provide only nominal clarity in clickstream data visualization.Hence this paper developed improved Markov chain based clickstream data visualization model named as WebClickviz, which is explained in the following sections.The proposed visualization model utilizes a heuristic determination method in general Markov chain to overcome the issues of display clarity and occlusion.The remainder of the article is organized as: Section 2 discusses some the most related research works.The improved markov chain modelling is discussed in Section 3. Section 4 focuses on the webclickviz visualization methodology while section 5 presents the visualization performance and evaluation results.Finally, Section 6 explains a conclusion about the proposed work.

Related Works
Website clickstream data visualization is a step by step procedure by which the user propagation is tracked from the server log files and clickstream files.In [17], an extensive survey has been made to clickstream data analysis.This work discussed about the scientific visualization and information visualization creates graphical models on the KDD process.More than offering resources for interactive visual exploration of databases, visual mapping techniques are presently being used to enhance user interpretation of mining errands and furthermore as an integrated some portion of expository DM calculations.Many mining techniques require user intervention at different stages and representation is beginning to be used to bolster the decision processes involved in making such interventions.
In [18], Moe has proposed an empirical two-stage choice model with the varying decision rules of the clickstream data.The author proposes and applies an empirical two-stage choice model to Internet clickstream information that captures observed choices for two choice stages: items viewed and items purchased.
The model takes into account interdependences between choices inside a stage and the use of changing decision rules in each stage.The author accommodates heterogeneity in preferences and in decision rules.The proposed model uses observed choices to infer both attribute preference evaluations and criterion attributes, examinations and criterion attributes.
In [19], the authors proposed a practical methodology for the prediction of demographic web site guest profiles that can be used for web advertising targeting purposes.The methodology involves the change of web site guests' clickstream patterns to a set of features and the preparation of Random Forest classifiers that generate predictions for gender, age, educational level and occupation category.These demographic predictions can bolster online advertisement targeting (i) as an extra contribution to personalized advertising or behavioral targeting, in order to restrict promotion targeting to demographically defined target gatherings, or (ii) as a contribution for aggregated demographic web site guest profiles that bolster marketing managers in selecting web sites and achieving an ideal correspondence between target gatherings and web site audience piece.
In [20], the authors employed a big data approach to discover the user interests in e-commerce.The authors of [21] also employed similar approach to extract customer shopping types from online sites.In [22], the authors introduced VisMOOC, a visual analytic system to help analyse user learning behaviours by using video clickstream data from Massive Open Online Courses (MOOC) platforms.They work closely with the instructors of two Coursera courses to understand the data and collect task analysis requirements.In [23], the authors applied some standard algorithms to CFA prediction in this setting, and showed how one type of behavioural data collected about students -videowatching clickstream events -can be used as learning features to improve prediction quality.This can be taken as motivation for the future researches of clickstream data.Though there have been various techniques been utilized successfully for data analysis, most techniques relied on standard Markov chain.As stated earlier, the drawbacks in standard Markov chain reduces visualization quality and hence this research model focuses on eliminating them.

Improved Markov Chain Modelling
The shortcomings of standard Markov chain [24] for the website clickstream data visualization led to the development of the Improved Markov chain.This improved version overcomes the occlusion and display problems by heuristic determination of the grid spacing distributions.The Kolmogorov-Smirnov distance and maximum likelihood estimator are used between the underlying distribution states to minimize the Markov distribution.Considering the probability space (Ω, ॲ, ℙ), equipped with a filtration ॲ = {ℱ(t): t ≥ 0}.Let the continuous stochastic process X(t) = {X ୲ , t ≥ 0} be the solution of the univariate jump-diffusion process with an preliminary value X = x , where ϑ denotes the unknown parameter set; μ(. ) and σ(. ) define the drift and diffusion functions; W ୲ is the Wiener process; Ρ(. ) represents a Poisson random measure with intensity μ(X ୲ ; ϑ).Given a mark set ζ, the jump coefficient η has a mark density ‫,ݒ(ߞ߶‬ ܺ ௧ ).
For a continuous time Markov chain with a finite support, the grid elements are assumed to be monotonically increasing.Let h denotes the grid spacing between two adjacent grid elements on a n grid points Markov chain while I be the unit matrix.. Define a n×n rate generator matrix by Q = (q ୧,୨ : i ≠ j), with the rate elements q ୧,୨ subject to the conditions: q ୧,୧ ≤ 0, q ୧,୨ ≥ 0 and ∑ q ୧,୨ ୨ = 0.The transition probability from state x ୧ ୦ to x ୨ ୦ in time t, for a homogeneous continuous time Markov chain, is obtained by For the jump-diffusion in Eqn.(1), because of the freedom of the continuous parts from the hop parts, we can compose the comparing rate generator network Q as where the ± denotes the respective absolute value.However, when the grid spacing is too coarse, the proposed rate matrix formula exhibits an approximation error of hหμ൫x ୧ ୦ ൯ห in matching the second moment.
Hence the corrected formula is presented to address this error subject to the necessary condition of Considering the empirical distribution of the data, the generalized Q ୡ formula is needed to accommodate a non-equidistant grid setting while satisfying the local consistency condition.For a n-state non-equidistant Markov chain with n −1 associated grid spacing of h, the Q ୡ is given by The following condition is needed to be satisfied for a well-defined probability matrix to be guaranteed.
Then the jump part is approximated, in which the matrix elements for the jump-diffusion generator matrix are given by This setting can have a state-subordinate jump force and a jump distribution, which considers a conduct back translation, is hard to fuse with conventional numerical strategies.The execution of a model will be touchy to matrix separating and the lower and upper limits of the lattice.The benefits of the non-equidistant (nonuniform) lattice have been recorded in the exploration territory of finite difference methodology (FDM) and partial differential equations (PDE).
The improved model acquaints a heuristic approach with examining the matrix components for a n-states Markov chain, to such an extent that the Kolmogorov-Smirnov distance between the first distribution function G(X) and the Markov distribution function G ෩ (X ୦ ) is limited.In such a case, we show that a non-equidistant Markov model can accomplish more elevated amount of exactness than an equidistant Markov display.The subsequent network x ୦ as for the Kolmogorov-Smirnov distance is given by A repercussion of the Markov chain move likelihood network is the semi-explanatory log-likelihood function, which can be utilized to align the parameters of a jumpdispersion.The maximum likelihood estimator (MLE) of Improved Markov chain is characterized by where ℒ(ϑ) is the log-likelihood.Given m discretely checked time arrangement data x ୲ଵ , x ୲ଶ , … x ୲୫ , the loglikelihood value produced by a period homogeneous transition probability matrix is given by 11) This principle of the improved Markov chain model can significantly enhance the visualization performance.

Tool and Data
The analysis of the website clickstream data has been carried out worldwide using many tools.Google analytics is one of the famous tools which have the basic functionality of clickstream data visualization.In this paper, Apache Flume is utilized to load, analyse the clickstream data and visualize it.Apache Flume is a distributed, reliable, and available service for productively gathering, aggregating, and moving a lot of streaming data into the Hadoop Distributed File System (HDFS).It has a straightforward and adaptable engineering in light of streaming data streams; and is robust and fault tolerant with tunable dependability instruments for failover and recuperation.Apache flume ingests the streaming data from multiple sources into the Hadoop storage and analysis and then insulates the buffer storage.
Flume utilizes channel-based transactions to ensure reliable message delivery.At the point when a message moves starting with one operator then onto the next, two transactions are begun, one on the specialist that conveys the occasion and the other on the specialist that gets the occasion.This guarantees ensured delivery semantics.The data used to load Apache Flume is the data that describes the page visits of users who visited msnbc.com[25].Visits are recorded at the level of URL category and are recorded in time order.The data comes from Internet Information Server (IIS) logs for msnbc.comand news-related portions of msn.com for an entire day.The categories are "frontpage", "news", "tech", "local", "opinion", "on-air", "misc", "weather", "health", "living", "business", "sports", "summary", "bbs" (bulletin board service), "travel", "msn-news", and "msn-sports".A total of 989818 users have been recorded with average visits of 5.7 per user.Fig. 1 shows the sample view from the input data collected from msnbc.com.Fig. 3 shows the loaded data while the Fig. 4 shows the aggregated data.The data is loaded by means of loaddata() command which asks for the folder location of the data.When given, the data is loaded into the tool and can be viewed.

Fig.4. Aggregation of CRM data
Any page requests served via a caching mechanism were not recorded in the server logs and, hence, not present in the data.Fig. 5 shows the visualized categories.
The clickstream process is executed once the data are loaded.This includes the aggregation and categorization view.
An implementation of Flume's RpcClient interface encapsulates the RPC mechanism supported by Flume.The user's application can simply call the Flume Client SDK's append(Event) or appendBatch(List<Event>) to send data and not worry about the underlying message exchange details.

Geographic Representation
The visualization is complete only when the data are visualized either in graphical or association representation.Fig. 7a shows the global representation of the clickstream data while Fig. 7b shows the graphical representation in USA specifically.

Visualization Performances
The performance of the WebClickviz is visualized in the charts given below.The charts are generated for the sample set of the msnbc.comwebsite clickstream data.It is seen that the accuracy is higher in the proposed model at all counts of urls.The fundamental reason for existing was to examine measurable methodologies on clickstream information, as the accumulated arrangement of site visit demands executed by a specific client, and other client route components, can give understanding into their expectations, particularly as for purchase engagement and real-time purchase likelihood prediction.This can enhance the web analytics techniques by employing different strategies.

Conclusions
As stated in this article, these results are very encouraging as new methods of targeting customers could be derived from this solution.The proposed model consisting of the Improved Markov chain based visualization (WebClickviz) improves the web analytics by providing accurate visualization of the website clickstream data.This article suggested the method of interactive visualization in order to utilize these results in the analysis of data for different applications.In the field of clickstream data research is still in its earliest stages, much research still should be finished.With the rebellion of new and speedier innovation, the idea of big data is exceptionally hot right now, particularly on the grounds that companies can, more than ever, make an interpretation of customer data into higher revenue.In the future researches, it will be analysed how to utilize these results for different applications.Likewise the use of new learning algorithms to fit clickstream data, namely, by introducing other models such as neural

Fig. 5 .Fig. 6b .
Fig.5.Categories of msmbc.comdataThe user can provide the required Event arg by either directly implementing the Event interface, by using a convenience implementation such as the SimpleEvent class, or by using EventBuilder's overloaded withBody() static helper methods.Data visualization helps to optimize the website and improve the business sales and values.Fig.6a& 6b shows the Visualized results.It can be seen that the categorization is accurately completed and the visits of the users are recorded as shown.
Q ୨ signify the generator framework that approximates the continuous part μ(.)dt + σ(.)dWt and jump part ∫ η(.)Ρ(. ) individually.Since in continuous time, a stochastic differential condition is completely portrayed by its mean and fluctuation, a very much characterized Q-rate network will coordinate the chain's first and second prompt minutes to those of the fundamental procedure.The approximation for Q ୡ matrix for univariate diffusion and the rate elements are given by