Robust Estimation of Optimal Sample Size for CMM Measurements with Statistical Tolerance Limits

The paper proposes the kernel probability density function approach to estimate the distribution of measurements on a part which is measured in a coordinate measuring machine (CMM). The study is based on the experimental data derived from internal cylinder measurements. The distribution free model suggested by Wilks was used as a reference for the selection of the sample size. Three cross sections of a cylinder were measured regarding to this reference. The work defines the minimum required sample size for obtaining at least 0.95 proportion of radius variation for particular studied cylindrical part with 95% confidence level.


Introduction
The main goal of Geometric Dimensioning and Tolerancing (GD&T) inspection of a part is to assess if the geometry and dimensions of the part are inside of the specified tolerance limits to verify that an assembly fits, or that the intended functionality of the part is guaranteed.This paper deals with verification of radius size variation of internal cylinder.
Coordinate measuring machines (CMMs) are universal and widely employed automated measuring systems in industry [1].One of the most critical parameter of measuring strategy with CMM is the number of measuring points that is used to extract data from the part features.Obviously, a greater number of measuring points provides a better accuracy, however it leads to higher time consumption and costs.Since the accuracy requirement in a design specification is defined by the tolerance interval, within which the part dimension or geometry may vary, then evidently, it should exist a certain number of points sufficient enough to confirm with some given probability if the size is inside of the tolerance limits or not.
The influence of sample size on the measurement result has been widely discussed, and several different approaches has been used to estimate the contribution to the measurement uncertainty.Approaches such as statistical methods [2,3] (for normal distribution), fuzzy logic [4], genetic algorithm [5], extended zone model optimization [6], adaptive sample strategy with use of Kriging models [7] and analytical methods with implementation uncertainty simulations [8,9] have been suggested.However, a standard guide or criterion for sample strategy with CMM GD&T inspection has not been established yet.
According to [10] "a statistical tolerance interval is an estimated interval, based on a sample, which can be asserted with confidence level 1 D to contain at least a specified proportion p of the items in the population.The limits of a statistical tolerance interval are called statistical tolerance limits."Theoretically, the statistical tolerance interval for the case of normal distributed data set is the most well developed method [11,12].There are tabulated data in international standard ISO16269-6 for calculating both one-side and two-sided statistical tolerance intervals for sample size 2 n t and the at least population proportion p for such confidence levels 100(1 )% D as 90%, 95%, 99%, 99.9%.However, if other tolerance intervals (in form of kV r ) need to be found, which are not provided by the standard, for example other p and/or not typical 1 D value, or other sample size, then the K.factor function from the "tolerance" package in R programming language may be employed to calculate the factor k .
This paper provides an approach for estimation of an optimal number of the measuring points for a two-sided statistical tolerance interval based on a distribution-free model.A continuous probability density function (pdf) from measurements from a workpiece was approximated by kernel density estimator (KDE).The estimated continuous pdf was further used to simulate different sample strategies and evaluation of the confidence level for detecting at least 0.95 content of total radius variation range.
2 Data and experimental study 1 An internal cylindrical hole of an aluminum workpiece produced by a turning operation with nominal diameter 60 mm and length 130 mm was inspected in a CMM (Leitz PMM-C-600) with an analogue probe to detect a largest possible deviation of radius variable.The cylinder axis was aligned with the vertical axis (z axis) of the CMM.Three cross sections of the cylinder are measured: the first close to the top (Section A), the second in the middle (Section B), and the third on the bottom (Section C). x y measured around the circle in each section, and a least squares circle (LSC) method was utilized to calculate the circle centre.The radius variable i r , for each measured point was calculated by: 2 2 ( ) ( ) The uncertainty of the CMM itself is about 10 times less than inspected radius variance range, and thus it is not considered in the analysis of sample size.
Intuitively it is clear that less number of points may provide lower measurement accuracy due to the probability that extreme points on the feature are missing in the extracted data set.We have used MATLAB source for our simulation approach to investigate the degree of influence of sample size on the inspection confidence level and the detected radius variation range.
It is always advisable to evaluate the normality of a distribution of the original data set in the very beginning.The Shapiro-Wilk normality test [13] was applied to the measured data sets by use of the shapiro.testfunction in the R programming language.The results are shown in Table 1.The extremely lower p-value (especially Section A and Section B) yields us a reason to reject the assumption about normal distribution of the measurements.

Distribution-free model
The distribution of the radius variable of the part is not known before we start the measurements.We will therefore suggest to use the Wilks criterion [14,15] to define the minimum sample size.The criterion is based on the order statistic.It postulates the following: if an investigated random characteristic belongs to a population of any unknown continuous distribution function, then at least a content p of the population included between the smallest observation min r and the largest observation max r of the data sample with confidence level (1 ) D , and a required minimum sample size min n .For the two-sided tolerance interval with the conditions determined above, can be expressed by following [10]: Results computed by (2) of minimal sample size for the two-sided statistical tolerance limits (between the first and n-th order of sample order statistic) with unknown continuous distribution, and predefined (1 ) D and p , are shown in Table 2.As long as the number of measuring points supposed to be the natural numbers ( min n ¥ ), negative solutions were not considered, and all results of Table 2 were rounded to the nearest upper integer.This particular fact together with the distribution independency of (2) provides a robust property of the method, which is further going to be confirmed by the expirement data and a simulation model.

D
(the actual confidence level is 95.02%), the minimum sample size is 473 (Table 2).That is the total number of measuring points we used in our inspection of the cylinder sections.

Kernel density estimation
In practice, the data distribution is often unknown and/or may contain outliers.Hence, it is reasonable to estimate the tolerance intervals based on more general assumptions when it is impossible to describe sample data with any known standard distribution functions.A possible solution in such case is an estimation of the pdf directly from the measured data sample.A nonparametric statistic may be used in this way.One possibility to do that is the well-known histogram technique.However, the histogram suggests a distribution interpretation only in form of bins and it is less useful for further appliance due to lack of continuity.Meanwhile a limited number of known pdf ( ) f r are available to describe a continuous-valued random variable (logarithmic, exponential and so on).To avoid such restrictive assumptions about the form of ( ) f r the KDE may be applied [16].Further, using the kernel estimator based on original measured data, an opportunity appears to generate any different random data samples regarding to the initial data distribution.

Kernels and weighting function
Similar to the histogram we need an estimator of ( ) f r .The probability that random variable is within of the interval r b r can be written as following: and hence 1 ( ) Alternatively, the frequency for the given interval could be estimated by the equation: , where the estimator ˆ( ) f r has the properties of a pdf, i.e. positive for any r and an integral area equal to 1. Then a weighting function ( , ) w b G can be generalized in this way: where b is the bandwidth or smoothing constant of weighting function and K is the standardized weighting function (with 1 b ), which is the kernel.

Kernels' parameters
The of smoothing of ˆ( ) f r depends on parameters of ( , ) w b G such as the kernel K and the bandwidth b, which determine a shape and a width of the weighting function respectively.The proper choice of K and b is a subject of an optimization problem.
The accuracy of the kernel density estimator can be evaluated by mean squared error (MSE), mean integrated squared error (MISE) and asymptotic mean integrated squared error (AMISE).According to previous research [17] Epanechnikov function was defined as the optimal kernel in respect to The bandwidth b depends on different factors e.g.unknown pdf ( ) f r , kernel type, number of observations in the sample and so on.There are a number of methods available to optimize the bandwidth parameter such like bias cross-validation (BCV), unbiased cross-validation (UCV), direct plug-in rule (DPI) and others.The methods can have a different performance dependently on the estimation function ˆ( ) f r used and the pdf ( ) f r estimated.Thus, we use Epanechnikov kernel and default MATLAB bandwidth estimation in this study.

Date estimation by kernel function
The radius variable i r used in simulation was computed by (1) with assumption of a unique circle centre, which was obtained from LSC based on 473 measurements points.The rounding of the values by one decimal place ( 4 1 1 10 mm), on the one hand allows considering the cylinder form tendency and possible outliers, and on the other hand do not take into account unnecessary accuracy requirements to the data estimated circle centre coordinates.

Figure 2. Kernel estimates ˆˆ( ), ( ), ( )
f r f r f r for three sections based on 473 points sample size machine.
Estimates of pdf ( ) f r for the three cylinder sections based on Epanechnikov kernel and the sample size 473 measuring points are shown on Fig. 2. Observing the curves one can notice a distinction of the distribution parameters such as the mean values, the variations and the data spread for the cross-sections, which belong to the same cylinder.
All these mentioned facts together with rounding of the sample point numbers give us the reason to presume that the small centre coordinate offsets can be neglected.The robustness of the simulation model is discussed in the next sections.

Estimation of an optimal sample size
In order to discover an optimal sample strategy for the inspection of the part, a statistical simulation was carried out in MATLAB, by using the KDE ˆˆ( ), ( ), ( ) , f r f r f r estimated from the CMM measurements of workpiece, see Fig. 2.
The eight initially predefined different sample sizes {5; 10; 15; 30; 60; 90; 93; 95} n were simulated with 5 10 iterations for each sample size i n .The maximum max r and the minimum min r values were detected for every new generated sample.The population content p for the each iteration was evaluated as a difference of the  cumulative distribution functions (cdf) of maximum and minimum random variable i r , based on KDE.Then conditions of equality/exciding 0.95 of total radius variation range was tested by: where ( ) R F r is cdf of a real-valued (either maximum or minimum) random variable i r , calculated with the kernel pdf estimator ˆ( ) f r .The number of successful iterations was assigned as 1 (or 0 -otherwise) and summed up as N Sum .The final probability % P for each sample size i n was calculated as a rate of / N Sum M , where M is the total iteration number.
In spite of the predefined initial sample size (473 points) taken from the Table 2 with given 0.99 p and 100(1 )% 95% D , the total area under the estimated kernel function is equal to 1 (from pdf properties).That gives as an opportunity to generate any size of data sample even larger than the initial sample size.This in turn leads to independency of our simulated results presented in Table 3 on the initial parameters of the model (2).

Discussion of results
Analysis of measurements obtained from cylinder sections shows that parts produced by a turning operation has unknown non-normal distribution of radius variables.The parameters (e.g.mean, variance, skewness) of the distributions in difference sections of the cylinder are apart from each other.However, the sample strategy according to table 3 for the sections is finally the same.Thus, the applied simulation model based on the experimental data confirms the robustness property of the method proposed by (2) in section 1.For example, Table 3 shows that the optimal sample size is about 93 points for all sections, which agrees with the data in ).Thereby the simulation based on distributions estimated by kernel function confirms Wilks model given by equation (2).We can also notice that the probability to estimate at least 0.95 fraction of radius variation range is only about 2% in the case of 5 points sample.
This fact gives us the reason to expect that further research of cylindrical parts with larger radius values, from other machine operations and different materials of workpiece, most likely will provide the similar results.
In the simulated model, the circle centre is assumed the same for any data samples.For different sample points the centre point would vary, but maximum possible range between min r and max r remains a similar.In addition, the minimum number of points min n (Table 2) was rounded to the nearest upper integer, thus it makes negligible the influence of the centre coordinates.Again, a good compliance of the simulation results with the distribution free model given in formula (2) demonstrate the insignificant influence of the assumptions about the centre of coordinates.That also confirm that the simulation model itself employs a robust principle.Namely, the identical optimal number of points (Table 3) for different sections with observably diverse distributions improves this statement.

Conclusion
The innovation of this work is to show the possibility to use the distribution free model (2) to predict the sample size, its least content and confidence level for GD&T inspection with CMM before any measurements are performed.
In addition, the simulation model for robust estimation of the optimal sample size based on the experimental measuring date has been developed.The provided simulation procedure allows evaluating the sample sizes and their confidence levels for real cylindrical components in industry independently of their dimensions and machining process accuracy.Moreover, the finite sample sizes, which often used in industry, were evaluated.The obtained results demonstrate the particular low confidence level especially for the sample sizes from 5 to 30 measuring points.
The inspection sample size in production is often defined with cost and time-consumption in mind, and thereby it could be too small.The applied technique provide the demanded guidance criteria based on the confidence level and the real data distribution for In solving of a practical problem, it is recommended to evaluate the distribution of the initial data in the very beginning.If the data distribution is close to the normal distribution then the standardized procedure (ISO16269-6) can be used to estimate the tolerance interval limits.Otherwise, the original distribution based on the CMM measurements with predefined confidence level 1 D , the variation proportion p and the minimum sample size min n should be estimated by (2).Then the smallest and the largest order statistics of the sample should be used as the tolerance limits.
/doi.org/10.1051/choosing the proper measuring sample strategy for GD&T inspection with CMM in manufacture.

Table 1 .
Shapiro-Wilk normality test for 473 points sample

Table 2 .
Minimal sample size min n for proportion p and confidence level 1 D

Table 3 .
The statistical test simulation with510 iterations for each i n sample size