Real-Time Stereo Matching using Adaptive Window based Disparity Refinement

In this paper, we propose a real-time stereo matching method based on adaptive window, aiming at the trade-off between accuracy and efficiency in current local stereo matching, Considering that the Census transform has good adaptability to image amplitude distortion, but may introduce matching ambiguities in regions with noise or similar local structures, we combine the Census transform with AD (absolute differences) for matching cost initialization, and adopt an iterative cost aggregation method based on ESAW (exponential step adaptive weight), in order to improve the parallelism and execution efficiency. Furthermore, in disparity refinement stage, we build adaptive window based on pixel’s color similarity and Euclidean distance for each unreliable point. and classify the unreliable points as ‘occlusion’ and ‘mismatch’, so different refinement strategies could be taken for different classifications. Finally the proposed method is optimized with CUDA (compute unified device architecture) and evaluated on graphic processor. The experiment results show that the proposed method is the most accurate one compared with real-time stereo matching methods listed on the Middlebury stereo benchmark.


Introduction
Stereo matching is the process of finding corresponding points that represent the same object in different view images of the same scene.The distance between the coordinates of corresponding points in different views, namely disparity, could be used to compute the depth of the scene.Stereo matching, which is widely applied into three-dimensional modeling, motion capture and intelligent navigation for vehicles, is a key technique in stereo vision field.Although the principle of stereo matching is simple, it is still highly challenging to accurately computing the dense depth map, due to interferences in realistic scenes, such as the complex actual situation, severe occlusion, lighting variation and image noise.Scharstein and Szeliski (2002) summarize and categorize the two-view stereo matching method and its evaluation system in a comprehensive way.According to the difference in optimization methods, dense stereo matching can be classified into global method and local method.Global stereo matching correlates depth estimation of adjacent pixels together through introducing smooth constraints, and estimates the disparity of all pixels by solving a global objective function.Classical methods include graph cut (Boykov et al. 2001) and belief propagation (Scharstein et al. 2003).Disparity maps of high precision can be gained by applying global optimization method, but due to its large number of parameters and high computation complexity, this method cannot be applied to systems with real-time requests.Local stereo matching usually adopts a window-based approach to conduct searching and matching individually for each pixel.Because it is unnecessary to take into account the correlation between adjacent pixels, the implementation of this method is relatively easier and efficient.However, this method calls for an assumption that the depth values of the pixels inside the window are the same, which, under many circumstances (especially in bounding areas where depths are not continuous), cannot be fulfilled; therefore, choosing the proper size for windows is the key point in this method.Too small matched windows will invalidate the smoothing effect and amplify depth noise; whereas too big windows will result in error matching in some fine structures and discontinuous areas.Adaptive weight (Yoon and Kweon 2006) and variable window (Zhang et al. 2009) are two classical methods of local stereo matching, which can ensure relatively high computation precision while significantly reducing computation complexity.Recently, in order to improve the effect of stereo matching on occlusion and uncharacteristic areas, some global-local hybrid methods (Hirschmüller 2008) and segmentation-based method (Bleyer et al. 2010) are proposed, further enhancing accuracy of disparity map on the premise of slightly increased amount of computation.However, because of the fact that dense stereo matching requires the separate calculation of disparity for each pixel in the images, sequential computing based on single processors can hardly meet the real-time requests.
The most up-to-date literature show that with the performance improvement of GPGPU (general purpose graphic processing unit), using parallel programming techniques like CUDA (compute unified device architecture) to improve and optimize stereo matching has become a research hotpot.Gong et al. (2007) realized parallel acceleration for 6 mainstream local stereo matching methods, and conducted a comparative analysis of their execution performances.Yu et al. (2010) based on the basis of adaptive weight method, put forward an iterative aggregation method using exponential step-size adaptive weight (ESAW), and improved the algorithm's speed by nearly ten times through parallel optimization on GTX8800 graphic processing unit.Zhang et al. (2011) parallel optimized the variable window method, and proposed a fast bit-wised voting method to accelerate the post-processing procedure of the algorithm, enabling the algorithm to reach a speed of nearly 100fps when processing images with a resolution of 320x240.However, due to the dependence on data during the cost aggregation stage, the above methods always make a certain concession in accuracy in order to realize parallelization.For the purpose of compensate for the accuracy loss of the algorithm during parallelization, Kowalczuk et al. (2013) presented a refinement method featuring iteration of horizontal and vertical similar to the aggregation stage to refine the initial disparity image, which effectively improves the accuracy of disparity map.But this method does not differentiate error points according to their types, so the estimation of disparity in occlusion areas is not quite satisfactory.
In this paper, targeting the balance between algorithm's accuracy and execution time, we present a novel real-time stereo matching method.With an adaptive window based disparity refinement process, the proposed solution can achieve the accuracy equivalent of global method, and further supports parallel processing so it can be used for real-time video applications.The rest of the paper is organized as follows.Section 2 describes the procedure of disparity estimation, including matching cost initialization and cost aggregation.Section 3 represents the details of disparity refinement, mainly focusing on occlusion detection and adaptive window construction.Section 4 shows the experimental results.Finally, we conclude with a brief summary in section 5.

1. Matching cost initialization
The core work of stereo matching is to find the corresponding points representing the same object in left and right images.Matching cost is the metric function to judge whether two pixels are similar, and it's the only basis for estimation of disparity.Therefore a good matching cost function plays a very important role in improving the algorithm's robustness in different scenes.Hirschmüller and Scharstein (2009) have studied and evaluated the current commonly used matching cost functions, and after numerous experiments, they found that the matching cost function based on Census non-parametric transformation (Zabih and Woodfill 1994) has the most balanced performance in all test collections, with a particularly good adaptability to intensity distortion caused by lighting variation.The main idea of Census transform is to build a p-centered window for a random pixel p, and generate a bit-string for each pixel in the window except p according to formula (1): (1) 1 ( ) ( '), ( , ') 0 Otherwise.

I p I p p p 
Here I(p) means luminance value of pixel p, and N(p) is the p-centered transforming window.Thus, the similarity measurement between the two pixels can be expressed by the distance between two Census transformed bit-strings, namely, Hamming distance: the smaller Hamming distance is, the higher the matching degree of the two pixels are.
Census transform calculates the matching cost according to the relationship of size between two pixels instead of relying directly on the values, so it possesses a better adaptability to range distortion.However, we find out that only using Census transform as the matching cost may result in following two problems: 1)Census transform has a lower pixel distinguishing degree for pixels in areas with simple textures and repeated structures.Because the specific pixel values are not considered, different pixels in these areas are likely to have the same Census transform values; 2)Census transform has a lower tolerance of image noise.Because Census transform is highly dependent on the central pixel, when noise points appear in the central pixel, the matching result is unpredictable.It is also mentioned by Hirschmüller and Scharstein (2009) that the matching results based on Census transform are not quite satisfactory in an environment of intense noise.
Aiming the above problems and considering that the commonly used AD cost function has better matching effects in areas with smooth textures, in this paper, we adopt the matching cost function that combines AD and Census transform together.For a random pixel p in the left view, establish a p-centered 9x7 window and define the matching cost of p as follows: Here d is the presupposed disparity value, D census and D AD are the matching costs based on Census transform and AD respectively, ρ(D,λ) is a normalized function, with the following computing formula: I L (x,y) and I R (x,y) refer to the gray values of corresponding pixels in left and right view respectively, C census is calculated from Formula (1), λ census , λ AD and τ are prior parameters.Normalized function avoids the matching cost overly leaning to one certain type, and the λ is used for easily controlling the weights of the two matching cost, which enables the algorithm to combine well the advantages of Census and AD, and to have a better adaptability to intensity distortion and noise.Figure 1 demonstrates the stereo matching results using AD, Census and AD+Census as matching cost.It can be seen that in Tsukuba and Venus, the effect of AD is better than that of Census, but in Teddy and Cones, the effect of Census is better than that of AD.Nevertheless, AD+Census, combining the advantages of the two, performs the best under the four sequences and with the lowest average error rate of all three methods.

Iterative cost aggregation
The work following matching cost initialization is to aggregate the initial matching cost for each pixel and to estimate the disparity value of pixels according to the aggregated matching cost.Matching cost aggregation is the most time-consuming step in local stereo matching algorithm, so it is necessary to exchange limited computation accuracy for a significant reduction of execution time. in this paper, in order to get a balance between execution time and accuracy, we take the strategy of exchanging accuracy for time in the stage of disparity estimation for significantly reducing the execution time; and then take a XXX-4 refinement step in post-processing stage to verify and update the initial disparity values for improving the computation accuracy with little time cost.Inspired by the idea of ESAW (Yu et al. 2010), in this paper, we adopt the exponential step-size iterative aggregation method to estimate disparity.The basic idea of this method is to construct a support window for each pixel, and then to use adaptive weight to aggregate the initial matching costs of the pixels in the window.It simplifies the original 2D aggregation into two 1D aggregations, namely horizontal and vertical aggregation, and uses an exponential iterative process to further reduce the time complexity.After calculating the final matching cost of central pixel under every hypothetic disparity value, we use the common WTA strategy to select the smallest matching cost as the initial disparity value of center pixel.
Compared to the traditional adaptive weight method, the iterative aggregation method is more suitable for parallel processing, but may lose some accuracy during computation, which is mainly caused by the following two aspects: 1)The traditional adaptive weight method computes the weights both in target image and reference image during the stage of matching cost aggregation, but the parallel method only computes the pixel weight of the target image and abandons the weight of the reference image, as shown in formula (7).This is more suitable for parallel computation, but the matching cost after aggregation is lower discriminative for different pixels.q N q N q N p q p q C q d p q C q d E p p E p p p q p q p q 2)In terms of weight calculation, exponential step-size iterative aggregation method decomposes the computation of a single weight into multiplication of multiple middle pixel weights, as shown in formula (8).This computing method would have error when there is a big difference between the middle pixel r k and the two target pixels p,q, meanwhile the difference of target pixels themselves are relatively small.This error then lead to mismatch, this kind of mismatching is particularly prominent in the occlusion and edge area of the image.
However, the proposed method is able to achieve a high level of accuracy while maintaining real-time operation by adopting an adaptive window based disparity refinement method.

1. Occlusion and mismatch detection
In this paper, we propose a disparity refinement method based on adaptive windows to improve the disparity accuracy with little time cost.First we conduct a consistency check on left and right disparity images through formula (9).d L and d R in formula (9) represent left disparity image and right disparity image respectively.Inspired by the work of Hirschmüller (2008), this paper further classifies error points into occluded points and mismatched points.According to the geometrical principle of stereo vision, if the corresponding point of a certain pixel point in the left view cannot be found in the right view, then its corresponding epipolar line in the right view should have no intersection with the disparity value of the right view, otherwise if the point in the scene can be seen in both views, then there must be an intersection.Therefore, we can use this principle to judge whether the error point is caused by occlusion or mismatch.
For the error point p in the left image, define d L (x,y p ) and d R (x,y p ) as the disparity values of pixels in same the row of p in left and right views respectively, define p's corresponding epipolar line e(p,d) as a point set about d, as illustrated in formula (10).If d R (x,y p ) and e(p,d) do not intersect, p should be perceived as the occluded point, otherwise, it should be perceived as the mismatched point.(10) Figure 2 shows the results of using the above method to detect occlusion on Tsukuba's left view.In the figure, there are the original image, the results of this paper and the ground truth successively from left to right.As the right disparity image are also needed in occlusion tests, which exist some errors when generated by our method, so as shown in figure 2, there are some differences between the ground truth and the detected occlusion.

2. Disparity refinement based on adaptive window
After classifying error points, different updating strategies can be adopted according to different error types.It can be known that the occluded point usually come from background pixels, therefore they can be updated by values with the minimum disparity in their adjacent background areas; while mismatched point usually occurs on object surfaces with complex textures, in either foreground or background, and the disparities in the adjacent areas do not vary too much, so these error points can be refined according to the statistical result of disparity values in adjacent areas.This paper proposes to build an adaptive window based on color similarity and Euclid distance for each error point.Theoretically, areas with similar colors should possess similar disparity values, the values of error points can be refined on the basis of disparity statistics inside of the window.The construction process of adaptive windows is similar to original variable window (Zhang et al. 2009), for any random error point p, define a four-element set as illustrated in formula ( 11): The elements in the four-element set constitute two orthogonal line segments in horizontal and vertical directions where p is located, as illustrated in formula( 12).The length of these two segments can be decided under the constraints of color and distance in formula ( 13), where L 1 、L 2 、τ 1 and τ 2 are the preset parameters, and their relations meet: L 1 >L 2 and τ 1 >τ 2 , q+ means the adjacent pixel following q, D s (p,q) represents the Euclid distance between p and q, D c (p,q) is the color difference between p and q, which calculating as illustrated in formula( 14).
Being different from original variable window (Zhang et al. 2009), this paper improves the constraints.Formula (13a) set an upper limit value for the size of the window; formula (13b) regulate that not only the color difference between any random pixel q and p, but also the color difference between the q's following pixel and p must be less than a threshold, so that part of the interference of image noise can be excluded; while formula (13c) regulate that when the distance between q and p is more than L 2 , a smaller threshold value is needed to further constrain the color difference between q and p, so as to better control the accuracy of the window size, making it possible to gain a relatively large window in areas where the color difference is small, and avoiding the window to be too small in areas with rich textures.
After establishing the foregoing orthogonal segments for each error point, all horizontal segments of the pixels on the vertical segment where p is located are combined together to form the p-centered adaptive window, as shown in formula (15).
Figure 3 shows the result of the adaptive window for four pixel points in Teddy left view.It can be seen that the shape of the window can change arbitrarily, and the bounding area of the window is basically consistent with that of the color, whereas in areas with smaller color differences, the window can reach the preset maximum value.After establishing the adaptive window, according to the previous analysis, the error points caused by occlusion could be updated by the minimum value of the reliable point in the window; and the error points caused by mismatch could be refined by the reliable point with the largest number inside the window.The concrete refinement method is shown in formula ( 16), where N*(p) stands for the set of reliable disparity points in the window centered in p; Ψ(d q ) is the statistical histogram of N*(p).
At last, this paper processes a 3x3 median filtering on the disparity image to further remove disparity noise.After that, the procedure of the algorithm is completed.

Experiment and Analysis
The proposed method has been parallel optimized, tested and evaluated on the nVIDIA GeForce GTS450 graphic card (192 CUDA cores and 1G graphic memory).Parameter settings involved in the optimization process are provided in Table 1.Other relevant parameters are set to the default value in reference paper.In terms of accuracy evaluating, this paper verifies the proposed method on the classic Middlebury benchmark: Tsukuba、Venus、Teddy and Cones.The results of disparity estimation are illustrated in figure 4, where the first column is the original image, the second is the ground truth, the third is the results of proposed method and the fourth column is the difference between the result of proposed method and the ground truth, in which the black area means the error on occlusion region and the gray area means the error on non-occlusion region.From the result of disparity estimation, it can be seen that the proposed method has more accurate result on the occlusion area, especially in Venus and Teddy, and the area with repeated textures (the net-like structure in the upper right) in Cones.But for the object's bounding area in Tsukuba and the bottom area of Teddy, the result of proposed method need further improvement.Submitting the third column of image in figure 4 to the Midllebury evaluation website, the quantitive results as shown in table 2 can be achieved.Table 2 displays the error pixels' proportion in all pixels, and the threshold value of error is 1, meaning that if the difference between computed disparity value and the standard value is bigger than 1, then it should be marked as an error.There are three types of errors all together, namely, nonocc: the error pixels' proportion in non-occlusion areas; all: the error pixels' proportion in all areas; and disc: the error pixels' proportion in discontinuous areas.Other algorithms involved in the comparison include: the effective SemiGlob method (SemiGlob), the classic adaptive weight method (AW), the method based on iterative disparity refinement (IterRefine), the method based on adaptive windows (VariableWnd) and the exponential step-size adaptive weight (ESAW).The latter three of them are real-time stereo matching method based on CUDA, and the IterRefine is the most up-todate real-time stereo matching method in Middlebury database.
From table 2 we can see that since ESAW method is AW based method with a simplification on aggregation process, it's not as effective as the AW method, which is consistent with the previous analysis.IterRefine method has conducted disparity refinement on the basis of ESAW, and has a better effect even than AW method.However, the proposed method is better than IterRefine, bringing the highest accuracy among all real-time stereo matching methods, and its average error rate is equivalent to that of SemiGlob method, with nearly 1 point lower than AW method and 30% higher than that of ESAW method.This should be mainly attributed to the accurate computation results in occlusion and complex texture regions.In terms of execution time evaluation, the proposed method is mainly compared with ESAW method, which is widely recognized as a good algorithm of real time.Besides, the proposed method is an improved method based on ESAW.Table 3 provides the average execution time of the two methods on four Middlebury evaluating sequences, which is the total time of estimating two disparity maps for both left and right views.It can be seen that the proposed method, having a procedure of disparity refinement which ESAM does not have, is slower than ESAW.The speed of processing Tsukuba by proposed method can reach nearly 20fps.But for images with larger resolution ratio and more disparity hypothesis levels, the proposed method cannot meet the real-time requirements.However, it's worth mentioning that the algorithm's execution time is closely related to the performance of GPU and the degree of parallel optimization.Considering the good parallel potential of proposed method, if using a GPU with more CUDA cores and larger memory, or conducting more in-depth parallel optimization according to the architecture, the execution time of the algorithm will be markedly improved.

Conclusion
This paper proposes a real-time stereo matching method based on adaptive window disparity refinement.The main features of this method include: 1) in association with the advantage of AD and Census transform, the proposed method possesses a better adaptability to the variations of the image's intensity distortion; 2) in the stage of disparity refinement, based on the assumption that areas with similar colors should have similar disparity values, an adaptive window is built for each error point, and the occlusion error and mismatch error are refined respectively according to statistics of disparity values in the adaptive window, which enable the proposed method to have better estimation results in areas with discontinuous depths; 3) the proposed method takes high advantage of parallelization, and after parallel optimization, it can reach real-time process for Tsukuba images.Compared to the existing real-time stereo matching methods listed on the Middlebury evaluation platform, the proposed method is the most accurate one.The next step is to conduct further parallel optimization for the algorithm to implement realtime process on the mainstream format images.In addition, there are too many preset parameters in proposed method, and the use of adaptive parameter mechanism is also worth researching.

Fig. 1 .
Fig. 1. Results of stereo matching using different matching cost Result of occlusion detection on Tsukuba

Fig. 3 .
Fig. 3. Examples of adaptive window construction q Fig. 4. Results of proposed algorithm on the Middlebury datasets

Table . 3
. Execution time of proposed algorithm and ESAW./ ms