Remote sensing detection enhancement

Big Data in the area of Remote Sensing has been growing rapidly. Remote sensors are used in surveillance, security, traffic, environmental monitoring, and autonomous sensing. Real-time detection of small moving targets using a remote sensor is an ongoing, challenging problem. Since the object is located far away from the sensor, the object often appears too small. The object’s signal-to-noise-ratio (SNR) is often very low. Occurrences such as camera motion, moving backgrounds (e.g., rustling leaves), low contrast and resolution of foreground objects makes it difficult to segment out the targeted moving objects of interest. Due to the limited appearance of the target, it is tough to obtain the target’s characteristics such as its shape and texture. Without these characteristics, filtering out false detections can be a difficult task. Detecting these targets, would often require the detector to operate under a low detection threshold. However, lowering the detection threshold could lead to an increase of false alarms. In this paper, the author will introduce a new method that improves the probability to detect low SNR objects, while decreasing the number of false alarms as compared to using the traditional baseline detection technique.

decreasing the number of false alarms as compared to using the traditional baseline detection technique. The paper is organized into the following sections: Section "Related works" provides a discussion on related research work published in open literature; Section "Experiment setup" focuses on the data used for our experimental results; Section "Methods" is a detailed discussion on our methods used; Section "Results and discussion" goes over our results; and Section "Conclusion" is a summary conclusion of the paper.

Related works
Recent artificial intelligence (AI) based object recognition methods uses a deep learning approach to segment objects in motion [2][3][4]. Deep learning algorithms can now achieve human-level performances on a variety of difficult tasks. Success on these methods generally require large training datasets [5] with quality features [5,6]. However, deep learning methods have several drawbacks; including understanding its reasoning in making decisions. These methods are unclear to a human observer [7,8] and requires large training data sets [5]. This can often lead to unexpected results when real data does not resemble those in the training datasets [9]. Moreover, studies have shown that these methods are vulnerable to adversarial attacks [10], fooling deep learning models to misclassify objects [10,11]. Resolving these challenges are especially important in time critical security surveillance application, where failure to detect a target could result in high consequences.
Traditional change detection methods rely on background subtraction [12]. The background is continuously updated as each new frame is received. For each new frame that comes in, the estimated background will be subtracted from the new frame to produce a "Difference Image". Thresholding can then be applied on the Difference Image to identify changes in the scene [13][14][15][16]. Though these methods do not require pre-trained labels, they generally have a higher false alarm rate. Traditional background estimation can typically be categorized into two categories: pixel-based approaches and dimension reduction approaches. The popular pixel-based approach-Gaussian Mixture Model (GMM) [17], is known for its simplicity in implementation. Dimensional reduction approaches such as Principle Components of Pursuit (PCP) [18] and other subspace estimation approaches [19] are known for their robustness on handling camera jitters and light illumination.
Despite research advances in background modeling and AI object recognition approaches, these methods alone are inadequate to detect low SNR targets. Low SNR targets with limited features and training examples are not ideal for machine learning problems. Lowering the threshold using traditional background subtraction detection, could cause an increased number of false alarms.
Thus far, Velocity Matched Filter (VMF) techniques have been introduced [20,21] and it appears to be extremely effective in improving the detection and tracking of low SNR targets. These methods enhance the target's signal and noise ratio (SNR) by integrating the target's energy over a period of time. It also matches the target's true velocity and direction of where it is traveling. Numerous Track-Before-Detect (TBD) algorithms [22][23][24][25][26] have evolved using VMF. TBD algorithms have been studied in a variety of remote sensors such as Radar [23,24,26,27], Sonar [28], and Infrared (IR) [29]. While earlier publications have demonstrated the effectiveness of using TBD algorithms in single target scenarios, it has also been used for tracking multiple targets, [25,30] generally requiring the number of targets to be known ahead of time. Dynamic programming algorithms for performing VMF without a known target velocity has also been developed [26,31]. While dynamic programming techniques seem to overcome run-time performance limitation, it tends to degrade on maneuvering targets due to the lack of motion modeling. More recently [32], motion models have been incorporated in TBD techniques to better handle target maneuver but the challenge remains on how to initialize tracking. Many of the published techniques are only demonstrated with simulated data.
Though VMF or TBD type techniques have shown to be theoretically appealing in low SNR target tracking scenarios, there are several challenges that make these approaches difficult to apply in the real world. First, the algorithm assumes a priori knowledge in which the number of targets are known or a priori knowledge to initialize the track. This knowledge is often not provided in real-time surveillance application. Recent advances made to [22,33] eliminate these problems are by inserting an additional detection process before the TBD processor. However, initiating a low SNR target track is a challenge for this framework. If the target is not detectable by the threshold set by the initial detector process, then TBD will not start. On the other hand, setting a low detection threshold to force TBD to initiate could lead to many false alarms. Secondly, to gain the maximum benefit of the TBD approach, it assumes the noise is white [21]. Nonstationary changes such as jitters or sensor noises which are not removed, can introduce correlated noises that can potentially degrade VMF's performance. Finally, VMF assumes that the objects can be seen and that finding the best matched hypothesis is always available. However, in a real-world situation, objects may not always be observable by the sensors. For example, a person's movement can be observed by the sensor, but not when the person is walking behind a big tree. In this situation, finding the best matched filter becomes difficult because all the match hypotheses could be invalid.

Experiment setup
To evaluate our technique in a real-world scenario, a video camera (Table 1) was placed at the top of Sandia Mountain (Table 2) to collect live traffic data. The distance of the camera to the target on the ground was approximately 4000 feet. Since the image is  large, (Fig. 1) only shows a cropped off portion of the image. The targets were barely visible, and they appeared like small dots. The size of the target(s) in the image ranged from 4 to 20 pixels. The camera video is subjected to jitter motion naturally induced by the wind. This setup allowed us to evaluate the challenges of remote target detection under a real-world natural setting.

Methods
To overcome the limitations of existing TBD methods, this paper provides the following key contributions: (1) An ideal "Normalized Difference Frame" calculation to perform VMF enhancement; and (2) A novel Constrained Velocity Matched Filter (CVMF) that combines known physical constraints with the target's dynamic motion constraints to enhance its SNR. Our processing workflow is summarized as shown in (Fig. 2).

Image stabilization
To eliminate motion jitters on the camera induced by wind, we used the first frame of the video as a reference frame and registered subsequent frames from the video onto  the reference frame. This was accomplished by using a frame-to-frame registration technique as described in [34] to create a stabilized frame.

Background estimation
The stabilized frame was then fed to a temporal background estimator so that the background subtraction could be performed. The process of background subtraction can be expressed mathematically using the following equation: where D(t) , corresponds to the Difference Frame at time t, F s (t) corresponds to the stabilized frame at time t , and B(t − 1) corresponds to the background computed in the previous time step. For simplicity in implementation, the popular Gaussian Mixture Modeling (GMM) background estimation method [17] was used in our processing. However, it is important to note that our method can also be applied to other temporal background estimation methods such as the Principal Component of Pursuit [18], and Subspace Tracking techniques [19].

Noise estimator
In general, a background cannot be perfectly estimated regardless of which background estimation method used. Hence, it is important to model a deviation of background models. To model the estimated background deviation, we estimated the temporal variance v of frame pixel location ( i, j ) at each time step t using an Infinite Impulse Response (IIR) filter with the following equation: where γ is the variance update rate [0,1]. The temporal standard deviation for pixel ( i, j ) at time t , is obtained using the following equation:

Difference frame normalization
Pixels in different parts of an image can have different temporal standard deviation, depending on factors such as the environment and the scene structure. For example, the temporal standard deviation of the pixels in the waterfall region with constant running water is much higher than the pixels of an empty field. Hence, it is important to normalize the Difference Frame with respect to its temporal noise estimation before any thresholding is applied. The Normalized Difference Frame N d for frame pixel location i, j in time t is expressed as follow: (1) While numerous existing methods attempt to detect objects on Difference Frames [13][14][15][16], our method attempts to find objects on the Normalized Difference Frame.

Constrained velocity matched filter
The Constrained Velocity Matched Filter (CVMF) uses a combination of physical constraint and motion estimation constraint to find, match, and integrate target signals along a motion path to enhance the target's SNR. Performing operation on the Normalized Difference Frame is more ideal because it reduces the risk of enhancing the noise on high noise region areas (e.g., high scene contrast region, waterfalls, etc.). For detecting vehicles in this video, a physical road constraint is imposed in the CVMF processing. However, for other applications, other constraints can be used, such as railroads for trains, pathways inside a building. A summary of the CVMF method is depicted in (Fig. 3).
Given the road constraints, we divided the path into different numbers of processing region (called "chips") along the road in the Normalized Difference Frame. An illustration is shown in (Fig. 4). The size of each "chip" used was 65 × 65 pixels. In general, the size of the "chip" should be selected based on the knowledge of the target's size and the path. For example, the region selected should be big enough to cover the width of the path with enough margin to account for path uncertainties. In addition, the region should be large enough to include non-targeted areas.  The continuous VMF process [20,21] can be implemented in a discrete form, by shiftand-add operation with different velocity hypotheses along the path region in both forward and backward direction. For instance, suppose an object's movement is within the camera's view over a sequence time step as illustrated in (Fig. 5). For each processing chip, we can perform a range of shift-and-add operation for a range of velocities in attempt to match the target's movement over a period of time (Fig. 6).
The "sum chip" is the summation of individual chips over the temporal window. Mathematically, this can be expressed as the following: where S , is the summation of the pixel i, j across multiple frames. ( i, j ) corresponds to the shift positions, and w , represents the frame window for the summation, and k corresponds to the index of the matched hypothesis. The total number of matched hypothesis K can be expressed as: where M is the number of directional hypotheses and N is the number of velocity hypotheses. Since the movement of the individual targets are constrained in a predetermined path, M is 2 in most cases (either forward or backward direction). M can be greater than 2 when the chip is at the intersection. The number of velocities depends on the target's speed. The units of the velocity in the target's movement can generally be described in fractions of pixels per frame. We started with an initial set of velocities and allowed for further refinement once a track has been established.
To find the detection in the sum chip S for a given hypothesis k , we first normalized the sum chip to form a Z-score chip. We can do this by computing the mean µ s and standard deviation σ s of the sum chip S . For dense target scenarios, it is recommended that a trim mean is used instead, to avoid high SNR targets inflating the mean estimates. Then, we compute the Z score of the sum chip Z s for each pixel i, j using the following equation: The following thresholding logic is applied to perform detection. If ( |Z s i, j | ≥ T ), then pixel i, j is a candidate detection. Pixel detected locations are generated from all hypotheses. They are consolidated to eliminate redundant detections from each chip. Adjacent pixel detections are clustered to represent a single target.
The centroid of the target's cluster is then fed to the Multiple Target Tracker (MTT) for association and tracking. To simplify, MTT is implemented using a simple 4-state constant velocity model [35]. An object's dynamic movement can be expressed mathematically using the following equations: where x corresponds to the state vector, y corresponds to the output vector, A corresponds to the system matrix, and H corresponds to the output matrix. The system includes additive process noise q and measurement noise r , which are modeled as white noise gaussian with zero mean. The constant velocity model can be expressed in the following form: where x 1 , x 2 represents the positions of the object, and x 3 , x 4 corresponds to the velocity state of each position component, and T , corresponds to delta time changes between the state update.
In matrix form, this can be expressed as: where Q , is the process noise matrix, and R , is the measurement noise matrix. Kalman Filtering can be used to predict and update the state estimates and its covariance estimate P at each time step. Prediction steps: Update steps: As the target(s) are being tracked, the state vectors x associated with covariance P (motion constraint) are fed back to the CVMF process to fine tune the pre-defined velocity bins and improve the accuracy of matching. Having feedback from a tracker to CVMF also adds robustness to maintain the tracking of moving objects in a temporary occlusion (e.g., a car temporarily obscured by a tree). The tracker's state is capable of propagating to the next time step, assuming the target is traveling in a similar speed without the need to re-initialize VMF filters. Different applications might require a more sophiscated modeling of dynamic behavior such as the target's acceleration [35].

Results and discussion
We compared our method with our baseline processing method as depicted in (Fig. 7). The baseline processing workflow is the same as the one in (Fig. 2) but without the additional CVMF component. The same video was used as an input to both processing methods. Both methods were evaluated over the same detection areas in the same regions of the image. Valid detections in the frames were manually labeled to provide assessment for the probability of detection and false alarm measures. (13) x(k|k − 1)=x(k − 1|k − 1) Fig. 7 Baseline detection processing Figure 8 shows a comparison of the Receiver Operating Characteristics (ROC) data curve as a baseline, along with different window of frames used in the CVMF calculation. The ROC curve improves as the window of frames increases; however, it reaches an asymptotic state at seven frames. A more sophisticated motion model is probably needed to integrate the target's motion over a longer framed window. An example of a qualitative comparison is depicted in (Fig. 9). The baseline Normalized Difference Frame shows a very low SNR target at around 4.0. After incorporating the CVMF enhancement, the target's SNR increased to 8.0.
To support our claim of performing VMF operations on Normalized Difference Frames instead of operating on Difference Frames, (Fig. 10) shows the ROC curve comparison of a baseline single Difference Frame thresholding versus using the CVMF method operating on Difference Frames. An example of a qualitative comparison is depicted in (Fig. 11). Though CVMF is also effective in boosting SNR on the Difference Frame domain, it does not perform as well on the Normalized Difference Frame. A direct ROC curve comparison of CVMF operating on Difference Frame versus CVMF operating Normalized Difference Frame is shown in (Fig. 12). As shown, operation on Normalized Difference Frames significantly outperforms operation on Difference Frames.

Conclusion
In this paper, we have provided a new method to enhance detection of low SNR targets. The innovation of this work is evident by the issuance of a U.S. patent [36]. Our CVMF method incorporates physical constraints from known roads and dynamic motion  constraints obtained from a Kalman Tracker to accurately find, match, and integrate target signals over multiple frames to improve the target's SNR. We have demonstrated our results using real data collected by an actual sensor. Our method has established a significant improvement over baseline traditional detection techniques. In addition, we have proven that our technique can achieve better performance if the CVMF operations were performed under the Normalized Difference Frames. Currently, our CVMF  performance converges to a steady state at seven frames. In the future, we plan to incorporate a more sophisticated motion model (e.g., constant acceleration model) to support a longer frame integration window. A more sophisticated motion model can potentially provide a more accurate estimation of the target's motion over a longer period of time.