Virtual Ground Truth, and Pre-selection of 3D Interest Points for Improved Repeatability Evaluation of 2D Detectors

In Computer Vision, finding simple features is performed using classifiers called interest point (IP) detectors, which are often utilised to track features as the scene changes. For 2D based classifiers it has been intuitive to measure repeated point reliability using 2D metrics given the difficulty to establish ground truth beyond 2D. The aim is to bridge the gap between 2D classifiers and 3D environments, and improve performance analysis of 2D IP classification on 3D objects. This paper builds on existing work with 3D scanned and artificial models to test conventional 2D feature detectors with the assistance of virtualised 3D scenes. Virtual space depth is leveraged in tests to perform pre-selection of closest repeatable points in both 2D and 3D contexts before repeatability is measured. This more reliable ground truth is used to analyse testing configurations with a singular and 12 model dataset across affine transforms in x, y and z rotation, as well as x,y scaling with 9 well known IP detectors. The virtual scene's ground truth demonstrates that 3D pre-selection eliminates a large portion of false positives that are normally considered repeated in 2D configurations. The results indicate that 3D virtual environments can provide assistance in comparing the performance of conventional detectors when extending their applications to 3D environments, and can result in better classification of features when testing prospective classifiers' performance. A ROC based informedness measure also highlights tradeoffs in 2D/3D performance compared to conventional repeatability measures.


Introduction
In Computer Vision (CV), the establishment of ground truth so that new feature classification algorithms can be properly measured is an ongoing topic of research. With 3D scanning, printing, and realistic rendering, there are increasing opportunities for CV to be applied to virtual scenes and a multitude of new approaches are exploiting this newly accesible niche [4]. In the field of 2D CV there are well accepted conventions for measuring interest point/key point based feature detection, the most well known being the work based on research by Schmid and Mikolajczyk [23] [14] that are still regularly used in more recent times [12], and have been used a great deal in other CV research relating to interest point/key point repeatability [17] [24] [16]. It is still challenging however to establish a reliable means of establishing better ground truth of real world environments for the purposes of testing 2D based interest point detectors [15] [12] [5].
Schmid's metric for evaluation of a set of detectors K, classifies points between two pixel arrays x i , as either repeated, or not and uses a ratio of true positives and true negatives to measure performance. A threshold based on a radial distance ǫ around each point in the reference scene x 1 determines classification. Equations 1, 2 and 3 describe this process, with x 1 representing the reference scene as a basis for comparison, and x i as the scene image I i is a member of a set of transforms j being compared. A homography H 1i of x i enables threshold distances to be measured with x 1 , and repeated points to be determined. The default threshold, ǫ=1.5, represents an error rate of 1 pixel distant, also known as the Moore neighborhood, and is considered by Schmid, and researchers in general that apply this metric, to be the optmial tradeoff. Points that don't share the same view area are removed from the validation process as they share no valid repeatable point candidates.

Repeatability in Virtualised Scenes
This paper builds on the work done by Lang et al. [11] [10], where they demonstrated that a virtualised space, whether it be of images, or 3D models, served as a viable testbed for measuring interest point (IP) performance of conventional 2D detectors. Other approaches to IP generation utilise the ground truth of the model directly [4], but classifiers that only utilise 2D data are not designed to utilise extra dimensions. This limitation means that they can be highly optimised for 2D scenes, but not 3D, and subsequently also means their performance can't be properly measured in real-world 3D scenarios. Additionally, the lack of ground truth available for optimisation means that 3D applications for 2D based classifiers are constrained.

Methodology
For the purposes of measuring the performance of IP detectors that utilise only 2D images, a rendering context is utilised to maintain consistency between 2D, and 3D. This preserves 2D consistency of detected IP classification, while also allowing for the precision that the world space of the rendering context provides. Unlike a homography H 1i of the pixel positions of points within two scenes I 1 and I i , the virtualised scene uses an inverse affine transform T −1 1i , which enables the precise mapping of detected features to each location in world co-ordinates. Standard Schmmid-based repeatability meausures utilise the pixel positions to determine whether a point is repeated or not, however the pre-selection of points represented as floating point coordinates (a world coordinate system for the 3D rendering context). The pre-selection step is described in equations 5 and 6, and replaces the algorithm to determine R i (ǫ) shown in equation 3, while not interfering with subsequent processing steps shown in equations 1 and 2. Additionally all points now include the z worldspace information as described by equation 4.
To enable 2D/3D pre-selection, D represents the vector dimensions to be utilised when measuring distance, while the function dist determines the distance from the reference point in world space. Pre-selection happens after the removal of points that don't share the same viewport have been removed, but before the points are converted to their pixel positions and ǫ thresholding is applied. By statically pairing the closest point with its corresponding reference point in 3D space before it is measured in 2D, it enables the comparison of 2D and 3D pre-selection with minimal disruption so that later analysis is simplified.
The testing configuration for 3D pre-selection of points follows the methodology done by [11]. It uses a 300x300 image (I i ) which applies 47 transforms (J) of each model in the x and y axis, relative to the viewport as the model is rotated from -50 • to +50 • in 10 • increments (11). The z is rotated from 0 • to 180 • in 10 • increments (19), and the model is scaled in the x,y axis from 1.0, to 4.0 in 0.25 increments (17). This will be applied in two different testing scenarios. The first consisting of a single model, and the other, a dataset of 12 models. Most of the models are 3D scanned, and sourced from commercial, and research sites. The 12 models tested were titled "bowl", "owl", "plaque", "vase", "obelisk", "pot" 1 "marbles" 2 , "apple" 3 , "Stanford bunny", "happy Buddha", "dragon" and "lucy" 4 . The "Stanford asian dragon" model is tested separately. The bowl, owl, plaque, vase, pot, apple and marbles are textured, and the rest use a generic white mesh.
The IP detectors tested (K) were Harris [6], KLT [8], FAST [22,2], SIFT [13] and SURF [7] as well as Rohr [21], Foerstner [3], Beaudet [1] and a different implementation of Harris [6], which have been implemented by the Vigra library [9]. The process for using pixel-based interest points in a pre-rendered image I i , in conjunction with a world co-ordinate space to determine repeatability is summarised in the following steps. Two different datasets are being used to measure the effects of generalisation, one of these assess repeatability at a more localised level per transform. The asian dragon model was chosen due to its increased non-homogenous surface, protrusions such as horns, and potential for misclassification of repeated points due to lack of depth due to 2D preselection. This affords an analysis based on the effects of generalisation, as well as seeing the effects of preselection for a single model.

Analysis of 2D/3D datasets
When it comes to comparison of the performance of detectors, the first obvious choice is to compare the repeatability at each epsilon threshold. In most cases an ǫ=1.5 is the preferred threshold for discriminating between detectors. Intuitively, it would be expected that interest points that are able to utilise the depth of the scene would result in more reliable and boosted repeatability rates, given that false positives can be avoided, and better candidates chosen. The results in figure 2 and 3 highlight that in most cases the 2D preselection of points provides improved repeatability performance, both across detectors, and across most ǫ thresholds. In many instances more interest points are also detected.
At a superficial level, this could imply that 3D preselection is in fact impacting on performance, and there are indeed a few theoretical corner cases that could justify this. Namely the fact that points could become occluded, and in fact become false positives that are picked up due to them being closer in 2D space compared to other candidate points. This is difficult to justify, however, as there are only a small number of the 47 transforms that could result in this type of occlusion (namely x, and y rotation of the model), and it also would require a very low number of points in order for more unusual or abnormal point candidates to be preselected. Additionally, when examining the asian dragon model at each transform, which can be seen in figure 1, we can see that repeatability at the scene level shows the same increase for 2D preselection across all scenes. Though it is important to recognise that this is a corner case, the effect (if any) and the criteria necessary to exploit this, requires exceptional circumstances.
To perform a comparison of each dataset that consisted of the singal asian dragon model, and the 12 model dataset, the instances of repeated point pairs between 1xj and ixj for each test were analysed, represented as A and B respectively. To find the tpr we intersect D A and D B to find true positives common to each testing configuration, and for the f pr, intersect and subtract the true positives. This is done at each epsilon which is described in equations 7 and 8. The intersection of repeated points D 3 ǫ , which represents the points that utilised 3D data, and D 2 ǫ which only used 2D data, provides a ratio of the number of This provides data sufficient for ROC analysis and calculation of an AUC. However, given the form of analysis that is performed in CV performance, which is to say that it is most common to compare according to the Moore neighborhood (ǫ=1.5), it is difficult to use the data in its current form for comparative analysis.
To normalise the tpr and f pr ratios for better comparison, the informedness at each ǫ threshold can be used, which also provides a performance evaluation that takes into comparison both true positive and false positive detections by each classifier. Informedness is determined by finding the difference between the tpr and fpr and has been demonstrated as being a reliable metric that can determine to a greater extent the similarity of data sets (compared to randomness) [18] [19] [20]. The informedness of each detector at each ǫ threshold can be seen in figure  4 and 5.

Informedness Optimisation
Based on the results that are shown in figures 4 and 5, there is a clear divergence in the positions of points when compared to preselection of points that finds closest points in a 2D and 3D environment, even though all other testing conditions are identical. It's clear that, unlike figures 2 and 3 that use only the true positives based on Schmid's approach, there is a substantial mis-classification of points that is not apparent when only true positives are taken into account. This should not be taken as a slight towards true positive repeatability, however, as establishing ground truth is a necessary prerequisite for such an analysis is notoriously hard to reliably or accurately measure in real world environments. It does highlight that there are substantial benefits in adoption of virtualised, or more ideally, 3D scanned real-world objects, so that a more objective ground truth exists that can make these performance analyses possible. Also of note is the fact that in the case of a singular, as well as more generalised dataset, in figures 4 and 5, the convention of ǫ=1.5, or Moore neighborhood points, is not necessarily indicative of being the most optimal, especially in the case of detections that are not able to preselect points with the assistance of scene depth. In fact, the informedness data suggests that ǫ=2.0 is generally more favorable across the majority of detectors when tested with the 12 model dataset under the current testing conditions. This informedness of 2D detections indicates that 2.0 should be the more preferred threshold when taking into consideration the tradeoffs of true positives, to false positive detections. Not only does it provide a more rigorous examination of 2D performance compared to 3D, but also indicates at which threshold 2D performance is best, which would be ideal for optimisation when it comes to taking classifiers out of the lab and into the real world. These tests demonstrate that the additional metric of informedness, in conjunction with better ground truth testing environments that can effortlessly switch between 2D and 3D, could provide new avenues of performance analysis beyond just concentrating on true positives.

Conclusion
This paper explores the topic of IP detectors and their repeatability across multiple scene transformations in virtualised 3D spaces with the assistance of 2D and 3D preselection. Though there is a clear move towards utilising 3D ground truth for classifiers that used 3D ground truth natively, 2D classifiers are not able to leverage this benefit. We have sought to formulate a performance analysis that is able to integrate 3D with the assistance of a vitrualised ground truth that gives a more balanced analysis of performance compared to conventional repeatability. It does so by building on the proof of concept that virtualised 3D spaces can be used for testing 2D based IP classifiers, and expands on this by testing the differences between finding nearest neighbor points via 2D and 3D worldspace co-ordinates, by preselecting best candidates before applying traditional repeatability metrics. Testing configurations consisted of a singular model and a 12 model dataset, to compare the effects of gerneralisation, and 9 conventional 2D detectors were tested across 47 transforms in x, y and z rotation, and x,y scaling. Though conventional 2D based repeatability showed slightly improved performance, more in depth analysis, made possible due to a more reliable ground truth, highlighted that 2D preselection produced considerable false positives compared to those selected using 3D. This was determined via ROC analysis, and was further refined to a singular performance metric using informedness to normalise results at each ǫ threshold. Normalisation via informedness also demonstrated that traditional conventional thresholds like only including the Moore neighborhood points as repeatable (ǫ=1.5) are not necessarily optimal, and other thresholds should be considered in 2D contexts for optimisation of classifiers when applied to 2D scenes, in the absence of 3D data.

Future Work
From the results of our work, there is a substantial difference in the repeatability, and by extension reliability of detected points. We have already begun further research building on this, to measure IP detectors in situations of rapidly prototyping and testing classifiers via Genetic Programming. We aim to build on existing research in CV/GP to explore the effects of virtual ground truth with 3D preselection (and without), to determine its effects on classifier design and performance. Another avenue of research being considered is how effective GP based classifier design is, when taken beyond virtualised 3D spaces into real world environments.
Another area that deserves further exploration is developing a means of preventing occluded points from potentially being preselected. It is common in conventional 3D graphics to use back plane culling where the depth of the scene is used to determine if a face is rendered or not for each pixel. Developing a similar process would help to avoid corner cases. This would be an involved process, and would likely require a sophisticated solution at the shader level, but would be a valuable addition to virtual ground truth environments and IP repeatability such as those where more complex transforms are involved and less points appeared in the scene in question. We consider this an important next step in pursuing interest point evaluation for repeatability purposes.