To overcome the limitations of existing TBD methods, this paper provides the following key contributions: (1) An ideal “Normalized Difference Frame” calculation to perform VMF enhancement; and (2) A novel Constrained Velocity Matched Filter (CVMF) that combines known physical constraints with the target’s dynamic motion constraints to enhance its SNR. Our processing workflow is summarized as shown in (Fig. 2).
Image stabilization
To eliminate motion jitters on the camera induced by wind, we used the first frame of the video as a reference frame and registered subsequent frames from the video onto the reference frame. This was accomplished by using a frame-to-frame registration technique as described in [34] to create a stabilized frame.
Background estimation
The stabilized frame was then fed to a temporal background estimator so that the background subtraction could be performed. The process of background subtraction can be expressed mathematically using the following equation:
$$D\left( t \right) = F_{s} \left( t \right) - B\left( {t - 1} \right)$$
(1)
where \(D\left( t \right)\), corresponds to the Difference Frame at time \(t,\) \(F_{s} \left( t \right)\) corresponds to the stabilized frame at time \(t\), and \(B\left( {t - 1} \right)\) corresponds to the background computed in the previous time step. For simplicity in implementation, the popular Gaussian Mixture Modeling (GMM) background estimation method [17] was used in our processing. However, it is important to note that our method can also be applied to other temporal background estimation methods such as the Principal Component of Pursuit [18], and Subspace Tracking techniques [19].
Noise estimator
In general, a background cannot be perfectly estimated regardless of which background estimation method used. Hence, it is important to model a deviation of background models. To model the estimated background deviation, we estimated the temporal variance \(v\) of frame pixel location (\(i,j\)) at each time step \(t\) using an Infinite Impulse Response (IIR) filter with the following equation:
$$v\left( {i,j,t} \right) = \left( {1 - \gamma } \right) D\left( {i,j,t} \right)^{2} + \gamma v\left( {i,j,t - 1} \right)$$
(2)
where \(\gamma\) is the variance update rate [0,1].
The temporal standard deviation for pixel (\(i,j\)) at time \(t\), is obtained using the following equation:
$$\sigma \left( {i,j,t} \right) = \sqrt {v\left( {i,j,t} \right)}$$
(3)
Difference frame normalization
Pixels in different parts of an image can have different temporal standard deviation, depending on factors such as the environment and the scene structure. For example, the temporal standard deviation of the pixels in the waterfall region with constant running water is much higher than the pixels of an empty field. Hence, it is important to normalize the Difference Frame with respect to its temporal noise estimation before any thresholding is applied. The Normalized Difference Frame \(N_{d}\) for frame pixel location \(\left( {i,j} \right)\) in time \(t\) is expressed as follow:
$$N_{d} = \frac{{D\left( {i,j,t} \right)}}{{\sigma \left( {i,j,t - 1} \right)}}$$
(4)
While numerous existing methods attempt to detect objects on Difference Frames [13,14,15,16], our method attempts to find objects on the Normalized Difference Frame.
Constrained velocity matched filter
The Constrained Velocity Matched Filter (CVMF) uses a combination of physical constraint and motion estimation constraint to find, match, and integrate target signals along a motion path to enhance the target’s SNR. Performing operation on the Normalized Difference Frame is more ideal because it reduces the risk of enhancing the noise on high noise region areas (e.g., high scene contrast region, waterfalls, etc.). For detecting vehicles in this video, a physical road constraint is imposed in the CVMF processing. However, for other applications, other constraints can be used, such as railroads for trains, pathways inside a building. A summary of the CVMF method is depicted in (Fig. 3).
Given the road constraints, we divided the path into different numbers of processing region (called “chips”) along the road in the Normalized Difference Frame. An illustration is shown in (Fig. 4). The size of each “chip” used was 65 × 65 pixels. In general, the size of the “chip” should be selected based on the knowledge of the target’s size and the path. For example, the region selected should be big enough to cover the width of the path with enough margin to account for path uncertainties. In addition, the region should be large enough to include non-targeted areas.
The continuous VMF process [20, 21] can be implemented in a discrete form, by shift-and-add operation with different velocity hypotheses along the path region in both forward and backward direction. For instance, suppose an object’s movement is within the camera’s view over a sequence time step as illustrated in (Fig. 5). For each processing chip, we can perform a range of shift-and-add operation for a range of velocities in attempt to match the target’s movement over a period of time (Fig. 6).
The “sum chip” is the summation of individual chips over the temporal window. Mathematically, this can be expressed as the following:
$$S_{k} \left( {i,j,t} \right) = C\left( {i + \Delta i,j + \Delta j,t - w} \right) + \ldots C\left( {i,j,t} \right) + \ldots C\left( {i + \Delta i,j + \Delta j,t + w} \right)$$
(5)
where \(S\), is the summation of the pixel \(\left( {i,j} \right)\) across multiple frames. (\(\Delta i, \,\Delta j\)) corresponds to the shift positions, and \(w\), represents the frame window for the summation, and \(k\) corresponds to the index of the matched hypothesis. The total number of matched hypothesis \(K\) can be expressed as:
where \(M\) is the number of directional hypotheses and \(N\) is the number of velocity hypotheses. Since the movement of the individual targets are constrained in a pre-determined path, \(M\) is 2 in most cases (either forward or backward direction). \(M\) can be greater than 2 when the chip is at the intersection. The number of velocities depends on the target’s speed. The units of the velocity in the target’s movement can generally be described in fractions of pixels per frame. We started with an initial set of velocities and allowed for further refinement once a track has been established.
To find the detection in the sum chip \(S\) for a given hypothesis \(k\), we first normalized the sum chip to form a Z-score chip. We can do this by computing the mean \(\mu_{s}\) and standard deviation \(\sigma_{s}\) of the sum chip \(S\). For dense target scenarios, it is recommended that a trim mean is used instead, to avoid high SNR targets inflating the mean estimates.
$$\mu_{s} = \frac{1}{p}\mathop \sum \limits_{p = 1}^{P} S\left( p \right)$$
(7)
$$\sigma_{s} = \sqrt {\frac{1}{P}\mathop \sum \limits_{p = 1}^{P} (S\left( p \right) - \mu_{s} )^{2} }$$
(8)
Then, we compute the \(Z\) score of the sum chip \(Z_{s}\) for each pixel \(\left( {i,j} \right)\) using the following equation:
$$Z_{s} \left( {i,j} \right) = \frac{{S\left( {i,j} \right) - \mu_{s} }}{{\sigma_{s} }}$$
(9)
The following thresholding logic is applied to perform detection.
If (\(|Z_{s} \left( {i,j} \right)| \ge T\)), then pixel \(\left( {i,j} \right)\) is a candidate detection.
Pixel detected locations are generated from all hypotheses. They are consolidated to eliminate redundant detections from each chip. Adjacent pixel detections are clustered to represent a single target.
The centroid of the target’s cluster is then fed to the Multiple Target Tracker (MTT) for association and tracking. To simplify, MTT is implemented using a simple 4-state constant velocity model [35]. An object’s dynamic movement can be expressed mathematically using the following equations:
$$\begin{aligned} {\varvec{x}}\left( t \right) = & {\varvec{A}} {\varvec{x}}\left( {t - 1} \right) + {\varvec{q}}\left( {t - 1} \right),\user2{ }\quad {\varvec{q}}\left( t \right)\sim N\left( {0,{\varvec{Q}}} \right) \\ {\varvec{y}}\left( t \right) = & {\varvec{H}} {\varvec{x}}\left( t \right) + {\varvec{r}}\left( t \right),\quad {\varvec{r}}\left( t \right)\sim N\left( {0,{\varvec{R}}} \right) \\ \end{aligned}$$
(10)
where \({\varvec{x}}\) corresponds to the state vector, \({\varvec{y}}\) corresponds to the output vector, \({\varvec{A}}\) corresponds to the system matrix, and \({\varvec{H}}\) corresponds to the output matrix. The system includes additive process noise \(q\) and measurement noise \(r\), which are modeled as white noise gaussian with zero mean. The constant velocity model can be expressed in the following form:
$$\begin{aligned} x_{1} \left( t \right) & = x_{1} \left( {t - 1} \right) + \Delta Tx_{3} \left( {{\text{t}} - 1} \right) + {\text{q}}_{1} \\ x_{2} \left( t \right) & = { }x_{2} \left( {{\text{t}} - 1} \right) + \Delta Tx_{4} \left( {t - 1} \right) + q_{2} \\ x_{3} \left( t \right) & = { }x_{3} \left( {t - 1} \right) + q_{3} \\ x_{4} \left( t \right) & = { }x_{4} \left( {t - 1} \right) + q_{4} \\ \end{aligned}$$
(11)
where \(x_{1}\), \(x_{2}\) represents the positions of the object, and \(x_{3}\), \(x_{4}\) corresponds to the velocity state of each position component, and \(\Delta T\), corresponds to delta time changes between the state update.
In matrix form, this can be expressed as:
$$\begin{aligned} {\varvec{x}}\left( t \right) = & \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ \end{array} } & {\begin{array}{*{20}c} {\Delta T} & 0 \\ 0 & {\Delta T} \\ \end{array} } \\ {\begin{array}{*{20}c} 0 & 0 \\ 0 & 0 \\ \end{array} } & {\begin{array}{*{20}c} {1 } & 0 \\ {0 } & 1 \\ \end{array} } \\ \end{array} } \right]{\varvec{x}}\left( {t - 1} \right) + {\varvec{Q}},\, \\ {\varvec{y}}\left( t \right) = & \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ 0 & 0 \\ \end{array} } \\ \end{array} } \right]{\varvec{x}}\left( t \right) + {\varvec{R}} \\ \end{aligned}$$
(12)
where \({\varvec{Q}}\), is the process noise matrix, and \({\varvec{R}}\), is the measurement noise matrix. Kalman Filtering can be used to predict and update the state estimates and its covariance estimate \({\varvec{P}}\) at each time step.
Prediction steps:
$$\begin{aligned} \hat{\user2{x}}(k|k - 1)\user2{ = } & \hat{\user2{x}}(k - 1|k - 1) \\ {\varvec{P}}\left( {k|k - 1} \right)\user2{ = } & \user2{A P}(k - 1|k - 1){\varvec{A}}^{{\varvec{T}}} + {\varvec{Q}} \\ \end{aligned}$$
(13)
Update steps:
$$\begin{aligned} {\varvec{K}}\left( k \right) = & {\varvec{P}}\left( {k{|}k - 1} \right){\varvec{H}}^{{\varvec{T}}} ({\varvec{HP}}(k|k - 1){\varvec{H}}^{{\varvec{T}}} + {\varvec{R}})^{ - 1} \\ \hat{\user2{x}}(k|k) = & \hat{\user2{x}}(k|k - 1) + \user2{K }\left( {{\varvec{y}}\left( k \right) - {\varvec{H}}} \right)\hat{\user2{x}}(k|k - 1) \\ {\varvec{P}}(k|k) = & \left( {{\varvec{I}} - {\varvec{K}}\left( k \right){\varvec{H}}} \right){\varvec{P}}(k|k - 1) \\ \end{aligned}$$
(14)
As the target(s) are being tracked, the state vectors \(\hat{\user2{x}}\) associated with covariance \({\varvec{P}}\) (motion constraint) are fed back to the CVMF process to fine tune the pre-defined velocity bins and improve the accuracy of matching. Having feedback from a tracker to CVMF also adds robustness to maintain the tracking of moving objects in a temporary occlusion (e.g., a car temporarily obscured by a tree). The tracker’s state is capable of propagating to the next time step, assuming the target is traveling in a similar speed without the need to re-initialize VMF filters. Different applications might require a more sophiscated modeling of dynamic behavior such as the target’s acceleration [35].