The Algorithm of Habitat Discovery in Bird Migration

Bird migration has attracted an increasing attention. The study of habitats has played a vital role in the birds migratory. Previous researches, however, have encountered many problems, such as great limitations on research methods, low data utilization rate, statistics-focused and ineffective data processing and analysis methods. In this paper, the algorithm of habitat discovery is put forward by using computer’s data-mining technology based on the spatio-temporal characteristics of bird-watching data. First the algorithm detects and eliminates duplicate data to guarantee data standardization. Then density-based clustering algorithms are used to identify habitats where birds gathered. Finally the habitats of birds migratory are discovered.


Introduction
All groups of birds are special in that a large number of species migrate annually between their breeding and non-breeding areas [1].The migration of birds has a great impact on the environment and production life of human beings.Studying the migration of birds can help people prevent the spread of epidemics and maintain species diversity.Therefore the study of migration habitat is crucial for people to protect the birds and natural environment and maintain species diversity.
Various methods are adopted and developed to carry out research on migratory birds from different aspects by domestic and foreign researchers in order to understand the migration patterns of birds.Among these methods, the fixed-point investigation is the earliest bird migration research method.The most common and most popular method to study the bird migration is bird-banding [2].It can be easily implemented and widely applied.But its monitoring cycle is long and the data recycling is quite complicated [3].Next method is the satellite positioning method.Its accurately collected data can achieve continuous tracking to the individual bird.However it is not suitable for small birds with high cost and difficult popularization as well as limited amount of data.There are also some other methods including radar monitoring, sensitive geographical location and others.But they have low precision, difficulty in popularizing, limited data and other issues.In addition, the usage and analysis of the collected migration data of birds has also attracted the attention of researchers at home and abroad.It is analysed by the early bird data that only through the track point marked in GIS by biologists or the distribution points gotten by artificial statistics can migratory lands and migration routes be acquired [4].In 2004, the Japanese scientist Shimazaki proposed to deal with bird flight data through using the method of ISODATA clustering [5].In this method the migratory state of birds is determined in accordance with their flight speed and further the migratory location is obtained.Nonetheless the migratory routes of birds still need to be manually marked and further the spatial location information is unable to be processed with this method.In 2010, Zhou Yuanchun and others found the birds gathering land by using the density-based hierarchical clustering algorithm to cluster bird GPS, the association of the aggregation rules and further migratory routes of birds by virtue of the Apriori algorithm or GSP algorithm [6].However some defects that the changes of the habitats and migratory routes of birds cannot be found still exist due to small amount of data, less number of birds and short time span of the data.In 2012 Li Xueyan and others established a bird-watching database using China Birding Report and displayed the changes of bird distribution of recent years by GIS [7].But it only relies on artificially statistical methods to mark bird discovery sites in the GIS, neither using the "quantitative observation" to analyse the data deeply, nor dealing with the problems of repeated sampling and uneven distribution for the bird-watching data.From the above there are still a lot of problems that need to be solved in the above researches which are embodied in: 1) the collected data shows the following problems such as incomprehensive, imprecise, and limited amount; 2) there is no much work on data standardization of source data; 3) the amount of data for analysis and study is relatively small; 4) the hidden knowledge in the data fails to be dug out.In this paper, the problem of traditional biology is abstracted as a computational problem.A feasible, efficient and general method is sought to solve the above problems, to achieve the effective treatment and utilization of bird-watching data, to find the habitats of migratory bird.This method can make up the shortcomings of the research of bird migration in China.
As the important supplementary information of traditional bird distributions, Chinese bird-watching data is comprehensive and reflects Chinese bird watching achievements accurately.These data comes from three aspects.The first is from the network such as Bird Report (www.birdreport.cn)[8] and China Bird Watching Network (www.chinabirdnet.org)[9] etc.The second comes from ornithological books and literature such as China Bird Report 2003-2007[10], China Coastal Waterbird Census Report 2005-2011 [11]and A Checklist and Distribution of the birds in Shandong [12]etc.The third is provided by many ornithologists led by Prof. Sai Daojian.Total 189350 bird watching records have been verified by ornithologists that insure the authority of these data.

Data characteristics
The spatial-temporal information of Chinese bird-watching data records including species, date, location, number and observer in detail as shown in Fig. 1.The following "Number" records the number of birds observed in each bird watching.

Data Statistics
The 239350 Chinese bird-watching records involve 24 Orders, 100 Families, and 1230 Species accounting for 85.7% of China's existing bird species [13].The range of these records covers 34 provinces, municipalities and autonomous regions, including Hong Kong, Macao and Taiwan approximately 46 years ranging from 1970 to the present.Seeing from Chinn's bird-watching records, the number of bird-watching records is more in the east than that in the west, and more in the south than that in the north.The proportions of bird-watching records also vary with years.The number of records from 2001 to 2016 accounts for more than half of the total.In addition, among the whole bird-watching records, there are 142078 records of migratory bird which accounts for 59.36% of the total.The remaining records are for non-migratory bird.
Although China's bird-watching records are distributed in the species, time and space, the advantages of bird-watching data are obvious such as low cost, bulk information, easy access, various species coverage, substantial data, long time span, high accuracy and convenient spatial-temporal data analysis [5].

Overview of Algorithm
Bird migration is a relatively long and complex process which reflects in spatial and temporal changes.The study of habitats has played a vital role in the birds migratory.In this paper Chinese bird-watching data is used as the data source.First the important attribute --"Number" (Fig 1) is introduced to solve the problems of repeated sampling and uneven sampling distribution based on the data characteristics of bird-watching records.The quality of the data will be improved.Then according to the temporal and spatial structure of the migration trajectory, the time and space attributions of the bird-watching records are dealt with separately.The potential information of the spatial -temporal data of bird is discovered.As a consequence, the quality of data mining will also be improved to some extent.This study will lead to more efficient utilization, processing and analysis of bird-watching data and provide new perspectives and new ideas for the study of bird migration in China.
Moreover migratory birds are employed as the study objects.This Algorithm is used to achieve two goals of the research: 1) to solve the problems of duplicate sampling and uneven distribution of sampling data; 2) to identify the habitats of migratory birds; These steps of the Algorithm are described in the following: 1) the bird-watching data is digitized and stored in the database as the form of GPS track point; 2) the temporal and spatial distance is calculated between the points to discover implicit duplicate data and the special points instead of the duplicated data are used to produce a standardized data set; 3) the preprocessed new trajectory point collection is clustered by the density-based clustering algorithm to obtain the high density area of the migratory activity that is used as a habitat during the migration.The relevant definitions and steps of the Algorithm are described as follows in more detail.

Digitizing the Text into GPS Trajectory Points
Each bird-watching record contains its unique spatial-temporal data information.In order to better show the distribution and migration of birds each bird-watching record is abstracted into coordinate points with times before the mining and analysis steps.Each record corresponds to a coordinate point.This tracing point that can stand for an individual or a group is used to display the bird distribution and migration.The textual information of "bird-watching site" in each bird-watching record needs to be converted into the latitude and longitude coordinates of GPS in order to facilitate compare calculation and presentation.API interface of Baidu map with high accuracy is applied to the textual information of "bird-watching site" in each bird-watching Definition2.Migratory trajectory of migratory bird: a sequence of spatial locations with time stamps is called a migratory trajectory of birds.A migratory trajectory can be expressed as: P(ej) = {p1(ej), p2(ej),…, pi(ej),…, pn(ej)} where pi(ej) is a sampling point of the trajectory and n is the number of sampling points.Ej is the moving object (event), ej ∈ E, E = {e1, e2, ..., ej, ..., eJ} is the set of moving objects (events), j ∈ [1, J] and J is the moving object Event).

Selection of Feature Point
There is a considerable duplication in the original bird-watching data.If these duplicate data could not be removed, it would affect the quality of data mining.Therefore the original trajectory points of migratory bird should be pre-processed and removed duplicates before analysis of the data.The problem of uneven distribution of bird-watching records is solved initially, which lays the solid foundation for further analyzing of the data.
There are two types of duplicate data in bird-watching data.The first type is explicit duplicate data manifesting the same time and place, namely pi(ej)=pi+1(ej).For such duplicates simple merge processing is needed.The second type is implicit duplicate data which is not easy to find.When a kind of bird repeatedly sampled over a continuous interval and a small regional range, its sample can be considered repeated.They manifest multiple sampling points of a bird species with similar temporal and spatial characteristics.The steps of the selection of Feature Point are as follows: Step1: Assign the values of θr and θt; Step2: A point pi (ej) is chose as a center arbitrarily in the set P(ej) = {p1(ej), p2(ej),…, pi(ej), …, pn(ej)}; Step3: Calculate the distances of the remaining points except the central point of the set P(ej) to pi(ej); Step4: If Distance (pother, pi)≤ θr && |tother -ti|< θt，then the point pi(ej) is added to the DPS P'(ej); Step5: Output P'(ej); Step6: The k-mediods algorithm (Han et al. 2012) is used in the DPSs P'(ej).The cluster of k-mediods algorithm is set to 1.The central point of cluster sk (ej is used as the feature point of pi (ej), sk (ej) = <(xk, yk), tu, tv, ej>; Step7: Calculate the weighted average of points in P'(ej).The weight average of a point is the value of "Number" that is used as a new weight of sk (ej); Step8: Repeat Step2-Step7 for each point in P'(ej) until all FPs are output; Step9: Rearrange the feature points and the tracking points that are not in the DPSs.Then a set of new trajectory points about ej is got, S(ej) = {s1(ej), s2(ej),…, sk(ej),…,sm(ej)}, m is the number of the new trajectory points; Step10: Output S(ej).

The Discovery of Habitats
There are many stopover sites in the route of birds migrant which consist the habitants with wintering ground and breeding place [14][15].Within the habitat, the number of bird populations is usually larger than that of the other regions [16][17].That is a place that has larger amount and higher density of birds is much more important to birds themselves which is more likely to become the potential habitats for birds.Following this idea, the area where birds are densely distributed must be found after the pre-processing of the new trajectory points.Definition5.Heat: The importance of a point or region is called the Heat Degree (HD).The "number of birds" is used as the weight of the point which is called the Heat Degree Point (HDP).If it has higher weight, the heat range will be larger.The greater the sum of HDPs is in a region, the greater the HD of the region will be.The steps of the discovery of habitats are as follows: Step1 Step2：The "number of birds" is used as the weight to calculate the HD of each cluster: the sum of multiplication of all points and the weights in a cluster cl(ej).The HD of a cluster is called the HDC; Step3：Calculate the HDP of each outlier point; Step4 ： Arrange all HDPs and HDCs in ascending order.If HD>MinHeat (the value of MinHeat can be set by the user) the point or cluster will be output that is used as the habitat of migratory bird ej: D(ej)={d1(ej),d2(ej),…,dl(ej),…,dL(ej)}, l∈[1,L]; Step5：Repeat Step1 to Step4 for E ={e1, e2,…, ej,…, eJ} until all habitats are output.

Time complexity of Algorithm
There are two stages in this Algorithm.The time complexity of each stage will be separately analysed.
The k-mediods algorithm is used when selecting Feature Points (FP).The time complexity of the k-mediods algorithm is O (k (n-k) 2).In this paper setting k=1, n is the total number of trajectory points, S is the total number of points in the Duplicate Points Set (DPS) and S < n.The time complexity of algorithm is O ((S/K) 2) when executing once and K is the number of Duplicate Points Sets (DPSs).This needs to be performed K times and K<<n.Therefore the time complexity of this stage is O (S2/K) < O (n2).
Then in the discovery of habitats the improved DBSCAN algorithm was executed once.Hence its time complexity is O (m2) where m is the number of input trajectory points and m<n (n is the total number of trajectory points).
In conclusion the time complexity of the Algorithm is: O (S2/K) + O (m2) < O (n3).This shows that this Algorithm can be completed in a relatively short time.

Experimental environment
Bird-watching data is used for studying migratory bird in China thus the algorithm of habitat discovery is proposed.In the environment barn swallows (Hirundo rustica) is used as examples to explore their habitats.The feasibility and effectiveness of the algorithm is verified by comparing with the results of authoritative ornithological literatures A Field Guide to the Birds of China[18] and A Checklist on the Classification and Distribution of the Birds of China [19].
The operating environment of this experiment is windows7 operating system and C # language is used to write the algorithm.The software development environment is Microsoft Visual Studio 2010 and SQL server 2010.

Social and scientific value
This Algorithm is general, practical and convenient.Ideally it can be applied to all bird-watching data of migratory bird in China.However the algorithm depends on a certain scale of bird-watching data.If the quantity of data is greater, the results of data mining may be more precise.If the data sample is too small, the accuracy of the algorithm will decline and the distribution and migration of migratory bird cannot be reflected truly.These results are confirmed by the experiments.Therefore it is important to collect bird-watching data constantly and adequately data pre-processing is necessary before data analysis and mining.On the one hand, the algorithm solves the repeated sampling of data sets and ensures the accuracy of data mining.On the other hand, massive and redundant data is compressed by this process.Thus the efficiency of data analysis is also improved.

Limitations and shortcomings
1) Bird-watching data cannot track individual bird.Therefore it is not enough to just rely on bird-watching data itself to verify the accuracy of the results.2) There are spatial and temporal discontinuities in bird-watching data.Hence it is difficult to solely rely on such data to analysis changes of bird migration's habitats over the years.Moreover this may insert a negative impact on the predicted results of bird migration.3) The uneven distribution of sampling still exists which prevents deeper data analysis and mining activities.

Future directions
First bird-watching data will be collected continuously.Additional migratory bird data such as satellite-tracking data and bird-banding data will be gradually introduced to supplement further mining and analyses.Second meteorological data will be added to our study.The authors will further study the effects of climate and environment on bird migratory habitats.

Conclusion
Bird watching activity has been developing rapidly in China in recent years.This activity will help people to understand the distributions of birds and population dynamics etc.This paper analyses the problems of bird migration in-depth from the perspective of mining data.Based on Chinese bird-watching records, the algorithm of habitat discovery is proposed and used for the selection and discovery of habitats during the migratory process of birds.Taking Hirundo rustica as example, maps and GIS demonstrate the feasibility of the algorithm.The time complexity of the algorithm is small，resulting in its high efficiency.The migration routes and habitats of the birds derived by this work are compared with that of the authoritative ornithological literature which shows more accurate and real results and verifies the accuracy of the algorithm.

Fig. 1 .
Fig. 1.Chinese Bird-Watching Data record.The coordinates of Baidu Map can be accurate after the 10 decimal point.By this Algorithm, the Migratory birds' corresponding coordinates are retrieved.Definition1.Trajectory point of moving object is to describe the moving object sampling including three parts of latitude, longitude and timestamp and moving object identification which is expressed as pi (ej) = <(xi, yi), ti, ej>, pi (ej) ∈ P (ej), ej ∈ E, j ∈ [1, J], i ∈ [1, n] where (xi, yi) is the track point position component.For example, ej = Hirundo rustica.A trajectory point is expressed as pi(ej) = <(xi, yi), ti, ej> where (xi, yi) are coordinates and ti is a time stamp.pi(ej)∈P(ej), j∈[1,J], i∈[1,n].
A density-based clustering algorithm is employed in the new trajectory points S(ej) = {s1(ej), s2 (ej),…, sk(ej),…,sm(ej)} to find the high density areas for bird activity.And the value of "Number" is introduced to calculate the values of each cluster and outlier.Then high-density and high-volume areas are screened out as habitats for migratory bird.Fig 3 shows various types of clusters.

Fig. 4 .
Fig. 4. Partial data of a DPS (Hirundo rustica) In the light of the bird-watching data and spherical distances based on latitude and longitude coordinates the following parameters are shown in Table1.After experimental tests the clustering results are best when θr=6 km and θt=10 days.Table1.Parameters of DPS

Fig. 5 .
Fig.5.Data pre-processing of the Trajectory Points (Hirundo rustica).a Duplicate Points Set.b Selecting the Feature Point.c The Original Trajectory Points d New Trajectory Points after Data Pre-processing

Fig . 6 .
Fig .6.The Discovery of Habitats (Hirundo rustica) a The Clusters of Trajectory Points after DBSCAN.b The Distribution Diagram of the Heat Degree c Habitats d Trajectory Points within the Habitat.e Part of the Habitat