An Optimized Structure on FPGA of Key Point Detection in SIFT Algorithm

SIFT algorithm is the most efficient and powerful algorithm to describe the features of images and it has been applied in many fields. In this paper, we propose an optimized method to realize the hardware implementation of the SIFT algorithm. We mainly discuss the structure of Data Generation here. A pipeline architecture is introduced to accelerate this optimized system. Parameters’ setting and approximation’s controlling in different image qualities and hardware resources are the focus of this paper. The results of experiments fully prove that this structure is real-time and effective, and provide consultative opinion to meet the different situations.


Introduction
Image matching is the process to identify the features of pictures to confirm the similarity and establish relevance between these similar pictures.The most essential parts of image matching is feature obtaining.The definition of features and the detection of them is the key to the whole process.There are many features in a picture, including edges and corners, and many algorithms are proposed to detect a specific type of features, like Canny operator and Harris corner.In 1999, David G. Lowe proposed a new algorithm [1], which is known as Scale-Invariant Feature Transform (SIFT), and revised it latter in 2004.This algorithm is established on the invariance of features no matter how environment changes.These changes can be described as multi-scales, and only those points which can show their invariance in each scale could be defined as key points.When all key points have been detected, corresponding descriptors will be generated and matching is the final step.With its distinctive process, SIFT algorithm has shown its advantages in many aspects, and been adopted widely.
Although SIFT algorithm has been realized on software platform consummately and proficiently nowadays, including a program in OpenCV patented by Lowe himself, it is another matter when SIFT algorithm is implemented to hardware platform.Time and resources consumption is the prime consideration; thus huge computations need to be transformed and simplified.Consequently, many scholars propose their designs for hardware implementation of SIFT algorithm.All these designs require the introduction of pipeline structure.This structure can utilize time most profitably.Those particular designs mainly differs in several specific steps of SIFT algorithm.The first step is the Generation of Gaussian Pyramid.ROI is introduced to generate suitable initial images in [2] and initial images are divided into smaller partitions in [3].The size of Gaussian window function varies in [3]- [6], even the number of scales and octaves [7].The value of V also differs in [3].Some IP kernels are introduced to calculate the filtering such as MAC and MUX in [2].The second step is the Detection of Key Points.ADI and DER are introduced to obtain key points more accurately [7].There are also some scholars using other methods like Harris corner to detect features [4].The third step is the Calculation of Gradient and there is no difference in this part.Other steps include Generation of Descriptor and Matching.Some scholars complete these parts in combination with DSP [7], or other method like BRIEF descriptor [8], and we do not discuss these steps in details here.
After referring to these multiple designs, we propose our own optimized SIFT algorithm and implement it to hardware platform with Verilog.We mainly emphasize and discuss the setting of parameters and controlling of approximating errors in different conditions.The whole system is verified to be real-time and effective.
The following content of this paper is divided into 5 sections.Section II briefly discusses the algorithm proposed by David G. Lowe and its optimization to suit the hardware environment.Section III is the design of whole structure on hardware.Section IV is the results of experiments.Section V is the final conclusion of this paper.

Theory of SIFT Algorithm and Its Optimization
To be frank, when we implement SIFT algorithm on hardware platform, there are a couple of problems we need to handle, especially the time and resources consumption.Therefore, we need to figure out its principle first, and consider how to adjust and optimize SIFT algorithm to meet the hardware conditions without the introduction of DSP.This significant algorithm can be divided into two main stages: Key Point Detection and Key Point Description.In this paper, we mainly discuss the first stage, and we will briefly introduce the theory of this stage and its optimization in this section.

Theory of SIFT Algorithm
It has been verified that the only possible scale-space kernel is the Gaussian function [1].The scale space of an image is defined in (1): ( , , ) ( , , ) ( , ) The Gaussian filters are defined in (2): The complete realization of Gaussian scale space has a couple of steps: first, each scale is generated from the previous scale, and i V are different in each scales; second, after generating some scales of image, we can assemble them and call them an octave, then the last third scale is down-sampled by a factor of 2 to generate the first scale of next octave; third, the generating processes are the same in different octaves and the i V series are the same, too; fourth, the initial image should be up-sampled by a factor of 2 to generate the zero octave, and smoothed latterly by a Gaussian filter to create the first scale of zero octave.
In this process, i V series are crucial to keep continuity in scale space.In Lowe's work, the initial image is supposed to be incrementally convolved with Gaussians by a constant factor k [1], and the i V is defined as Then we can detect the local maxima and minima of the middle scales, by comparing each sample point in these scales to its eight neighbors in the current image and nine neighbors in the scale above and below [1].Only the point which is larger than all of its neighbors or smaller than all of them is the extremum.
However, not all of the extrema are key points.Some of them have low contrast or are poorly localized along an edge [1].The most effective way to remove or revise these points is to use Hessian matrix, which is defined in (4): All valid and reasonable key points should meet the restriction of ( 5) and here 10 r : The final step is to compute the gradient magnitude and orientation.When the calculation is finished, we can store the results into RAMs and they can be easily called by Generation of Descriptor.

Optimization of SIFT Algorithm
The main optimization is in the step of Generation of Gaussian Pyramid.There are a couple of consensus in all implementations: first, the generation of all scale of an octave should be completed at one time, the new i V is defined in (6); second, the size of Gaussian filter should be fixed and medium-sized; third, approximating errors exist in any computation; fourth, the generation of the zero octave should be abandoned.' ' In this paper, we mainly discuss the realization of second and third point of consensus.
To the second point, on one hand, if the size of Gaussian filter is limited, differences compared with the standard values will be created.The differences will become more severe if the size of template is much smaller than the standard of ( 61) (6 1) Besides, these differences will be accumulated when ( 6) is used repeatedly.On the other hand, the size of template should not be too enormous, considering the restriction of hardware resources.However, if we assume 0 1.6 V , the shortest width of window to avoid the differences is 41, which is too difficult to achieve.Only [7] and [9] emphasize this issue and announce that they manage to adopt this size in their work.Therefore, it is criticized to set a medium-sized template.Yet some scholars prefer to choose much smaller size of template.It seems that much smaller size of template could also obtain satisfying results.Whether the restriction of setting size of template is meaningful will be examined and elaborated in Section IV.
Related experiments show that the larger of the template size, the more accurate of the final results.If we could not enlarge the size of template, we could reduce the value of 0 V , while the results may not as effective as the size broadening.
To the third point, approximating errors will introduce some redundant key points.If the number of these extra key points is too large, the speed of the whole system will be degraded.The only method to solve these problems is to scale up the coefficients first, rather than remove the redundant points in Key Point Detection.One is the enlargement factor of each pixel; the other is the enlargement factor of Gaussian template.The calculation of gradient magnitude and orientation also need to be revised.The formula for gradient magnitude is defined in ( 7): ) The orientation of gradient can be divided into some equal parts.In Lowe's work, it is 36 parts, and here we consider that 16 parts is enough.

FPGA Implementation of Optimized Algorithm
When the preparation has been finished, it is the time to implement this optimized algorithm.In this section, we briefly introduce the design of our pipeline architecture, and the main structure of each step of Data Generation.

Pipeline Architecture
The pipeline architecture is a kind of structure analogous to assembly line.There are some basic aspects: first, the data are sent in one by one in each data line; second, when the first datum is being carried on the second step of operation, the second datum will be carried on its first step of operation; third, the first octave of Gaussian Pyramid is the first part of the pipeline; fourth, Extrema Detection, Hessian Matrix Rejection, and Gradient Calculation could be operated at the same time.Apart from the common views above, we make some adjustment in our system: first, the size of the initial image is 284 284 u ; second, Gaussian filtering is separated into horizontal filtering and vertical filtering, with advantages analyzed in 2; third, the output of Extrema Detection, Hessian Matrix Rejection, and Gradient Calculation should be sent out at the same time.

Vertical Filter
A crude description of the relations of each steps and direction of data flows is shown in Figure 1.The operating process of the first datum costs 2306 clocks.And from this moment, the computing results of the points in first octave will be sent out continuously.Whereas the outputs of the second and third octave will not be continuous.

Structure of Optimized System
Due to the constraints of resources, we set the number of s as 2, the size of Gaussian filter as 15 15 u , the enlargement factor of each pixel as 1 and that of Gaussian The most important parts in these structures are the shift registers and the enable signals.A shift register can store a whole row of image data, and the outputs of these registers are continuous in a line.These output data are for the following computations.The enable signals are the key to control the identification of the valid data.
Buffers could be introduced to reduce the consumption of resources and the order of filters should be reversed.

Experimental Results
In order to evaluate the performances of the proposed architecture and examine the effects of setting parameters in different conditions, we divide the experiment into two parts: The first part has three steps: first, we test the processing time and observe the final results of Key Point Detection; second, we adjust the system in a further optimization and observe the final matching result; third, we change the size of window and observe the matching performances.
In the second part, noise will be introduced and the effects of parameters will be analyzed systematically.
The whole hardware implementation is realized on Xilinx ZC702, and clock frequency is 100MHz.The reference standards are the results in OpenCV.The results of tests are shown through MATLAB.

Experimental Results without Noise
When 0 1.6

V
, and other parameters are the same as those in Part B, Section III, the result of the algorithm in OpenCV is shown in Figure 5(a), and the result of our hardware implementation is shown in Figure 5(b).Only the results of second scale in first octave are shown here.We can discover that too many key points detected because of the unreasonable setting of enlargement factors.However, if we examine the ideal result without approximating errors, shown in Figure 5(c), we could conclude that to set window width as 15 seems inappropriate.Related data is presented in Table 1.
Yet the time of the whole process is 966990 clocks, less than 1ms.Suppose that Descriptor Generation and Matching cost 3ms per 1000 key points.Whole SIFT algorithm along with matching costs no more than 5ms.It is a considerable processing speed and is enough for continuous real-time matching.However, the consumption of resources is much higher, for those shift registers employ too much LUTs.In this case, buffers are necessary.
Therefore, we adjust and revise the structure of our system.The enlargement factor of pixel and template are switched into 64 and 128 respectively.The result of Key Point Detection is presented in Table 2.We could find out that the result is much better though still unsatisfying.However, when we observe the final matching results, shown in Figure 6, we could find out that the matching Furthermore, if we set the width of window as 3 and enlargement factors are adaptive, the results are shown in Figure 7.We would discover that this extreme case could also achieve the same accurate result while this can never be required on software platform.Besides, all key points detected here are generated because of the approximating errors introduced by simplification of calculations.
Therefore, we could conclude that on one hand, SIFT algorithm has the best ability of robustness.On the other hand, the setting of parameters does not need to be so strict when we do hardware implementation.In addition, matching results are what we should pay attention to, rather than intermediate results.

Experimental Results with Noise
4.1 is based on the condition without noise, the results could be different if noise is introduced.Thus, we introduce four kinds of noise respectively.The maximum extents of noises when matching results are still correct are presented in Table 3.We can reveal that the smaller value of window width, the poorer ability of noise immunity.In this case, the restriction of window width is significant.Various factors into consideration, we could conclude that a medium size of Gaussian filter is the best choice for hardware realization and 15 15 u is such one.

Conclusion
In this paper, we have proposed an optimized structure of Data Generation in SIFT algorithm, and its implementation on hardware platform.The whole system is based on a pipeline structure and verified to be realtime and effective.The results of experiments fully prove that the setting of parameters and controlling of approximations are crucial in hardware implementation, and this process will be affected by different conditions.
Here we recommend using a medium-sized Gaussian window function.

Figure 1 .Figure 2 .Figure 3 .Figure 4 .Figure 5 .
Figure 1.Structure of Pipeline template as 1024.The whole diagram of Gaussian Pyramid, Key Point Detection and Gradient Calculation is shown in Figure 2-4 respectively.

Table 1 .
Numbers of Detected Key Point is accurate.Here the size of larger image is 800 640 u and that of smaller one is 284 284 u , with an angle of rotation existing between two images. result

Table 2 .
Numbers of Key Point after Adjustment

Table 3 .
Limited Extent of Noise