 Research
 Open access
 Published:
Outofdistribution and locationaware PointNets for realtime 3D road user detection without a GPU
Journal of Big Data volume 11, Article number: 2 (2024)
Abstract
3D road user detection is an essential task for autonomous vehicles and mobile robots, and it plays a key role, for instance, in obstacle avoidance and route planning tasks. Existing solutions for detection require expensive GPU units to run in realtime. This paper presents a light algorithm that runs in realtime without a GPU. The algorithm combines a classical point cloud proposal generator approach with a modern deep learning technique to achieve a small computational requirement and comparable accuracy to the stateoftheart. Typical downsides of this approach, such as many outofdistribution proposals and loss of location information, are examined, and solutions are proposed. We have evaluated the performance of the method with the KITTI dataset and with our own annotated dataset collected with a compact mobile robot platform equipped with a lowresolution LiDAR (16channel). Our approach reaches a realtime inference on a standard CPU, unlike other solutions in the literature. Furthermore, we achieve superior speed on a GPU, which indicates that our method has a high degree of parallelism. Our method enables lowcost mobile robots to detect road users in realtime.
Introduction
Detection of road users is an essential task for autonomous vehicles and mobile robots, and it is crucial, for instance, in obstacle avoidance and route planning tasks. The task focuses solely on detecting road users, i.e., cars, pedestrians, and cyclists. This allows the utilization of classspecific biases; for example, objects are expected to be on the ground surface. The task differs from the more general object detection task, which aims to detect any object. Many object detection methods have been developed for detecting road users from point clouds. However, much of the efforts have been directed toward developing methods that achieve high accuracy, and significantly less effort has been directed toward lowlatency approaches. The present study addresses this issue with a novel architecture developed with attention to latency. An algorithm that has low latency and computational cost has a large number of benefits. It allows the better use of highfrequency (> 20 Hz) sensors as the algorithm is also highfrequency. A light algorithm decreases energy consumption, resulting in a longer operation time for mobile robots. It also removes the requirement for a GPU, enabling lighter weight and less expensive hardware, which is crucial in many applications.
To tackle the issue of latency in neural networks, attention should be directed towards the memory (DRAM) reading, as stated by Horowitz et al. [1]. This paper takes a unique datadependent approach to 3D road user detection to achieve low latency. Unlike most methods in the literature, our approach saves memory by discarding irrelevant parts of the input point cloud. Specifically, a simple algorithm removes coarse parts of the point clouds, such as the ground plane. Then, more complex algorithms, deep neural networks, are used for making the detections. Therefore, as the amount of data decreases, the complexity of the algorithm increases, and the neural network processes only the filtered relevant data. In practice, object proposals are generated with ground removal and clustering, and then a novel outofdistribution and locationaware PointNets make the detections from the proposals. We filter outofdistribution proposals throughout the pipeline to reduce memory and computational footprint. We are also the first to implement an outofdistribution detection method on classification and bounding box estimation tasks with point cloud object proposals. This approach achieves superior latency and comparable accuracy in 3D road user detection to the stateoftheart. Huang et al. [2] studied outofdistribution detection methods directly on 3D object detection, showing the applicability of such methods.
As stated, the literature has many point cloud object detection methods tested for 3D road user detection. However, most of them require a powerful GPU unit to run in realtime, which some mobile robots do not have due to, e.g., power, cost, weight, or size restrictions. To complement this underexplored area, we develop an algorithm that runs in realtime on a standard CPU. We summarize the contributions of this paper as follows.

A novel architecture for 3D road user detection on point cloud data running in realtime on a CPU.

To the best of our knowledge, we are the first to implement an outofdistribution detection method on classification and bounding box estimation tasks with point cloud object proposals.

A novel proposal voxel location encoder, which improves the accuracy of the models by a significant margin.

A ground segmentation method, which outperforms competitive methods. We present simple convolutional filters for sampling ground points for the plane fit.

A study on a 3D road user detection task with models trained only with the KITTI highresolution point cloud data and tested with lowresolution and lowperspective point cloud data.
Related work
This section presents frequent 3D object detection approaches that are wellproven and provide a fair comparison to our approach. The approaches are divided into voxel, graph, projection image, point, and bird’s eye view methods. VoxelNet [3] partitions the point cloud into voxel partitions. Points within a voxel partition are encoded into a vector representation characterizing the shape information. 3D convolution is performed on the voxels to predict bounding boxes and classes. SECOND [4] improved the accuracy and reduced computational cost of VoxelNet by preprocessing the point cloud by dropping voxels that include no points.
Authors of [5,6,7,8] use methods based on a range image. A range image is a point cloud projection on a spherical surface. The benefit of using projection is to preserve the information of neighboring measurements. However, the authors claim that some information is lost during the projection. The authors use a 2D convolutional neural network to predict bounding boxes and classes.
One approach is to use a bird’s eye view image as input for the detection algorithm [9,10,11,12]. PointPillars [9] is a popular method using this approach. It constructs pillarlike features from the point cloud with a neural network and forms a bird’s eye view image. Then, a 2D convolutional neural network is applied to this image, making predictions. Bird’s eye view is a convenient representation of the point cloud because objects are rarely stacked on top of each other in, for example, outdoor driving scenarios, given an optimal vertical fieldofview of the sensor.
More recently, graph neural networks (GNNs) have been implemented into the point cloud object detection task. Shi et al. [13] were the first ones to do so. In their method, point clouds are preprocessed into a graph representation utilizing a grouping method similar to PointNet++ [14]. Then, a GNN architecture predicts the classes and bounding boxes. At the time of publication, they achieved superior results in several public benchmarks, proving that graphbased preprocessing works well in the object detection task from point clouds.
Pointbased methods use raw point clouds as an input [15,16,17]. One benefit is that no preprocessing is needed, e.g., voxelization, range or bird’s eye viewprojection. Typically, this results in a lower computational cost overall. However, preserving the geometrical context is challenging since points are processed individually. Ngiam et al. [18] proposed that targeting computation to certain regions benefits computational cost and generalizability. They generated proposals with the furthest and random point sampling and utilized a neural network architecture for estimating bounding boxes.
Methods
A general schematic of the proposed architecture is presented in Fig. 1. The basic principle is to generate simple, unclassified proposals and utilize classifiers that differentiate between the indistribution (ID) and outofdistribution (OOD) proposals. This is implemented using novel energybased OOD PointNets. The first PointNet predicts the class probability vectors and energy scores, which are used for discarding the first batch of OODs. This is done in the first ID passthrough module. Then, 3D bounding boxes are predicted for the remaining proposals with an alternative PointNet, which also predicts energy scores. Further, the second ID passthrough module filters out the remaining OODs based on the bounding box energy scores. In addition, a novel proposal voxel location encoder (PVLE) is utilized to preserve location information otherwise lost in the proposal normalization process. PVLE attempts to increase the accuracy of the neural networks without adding a significant amount of computation. The increased accuracy will prove that proposal location information has value in the proposal classification and bounding box estimation tasks. The architecture aims to achieve low computational requirements by discarding information hierarchically while preserving helpful information regarding the road user detection task.
Ordered point cloud representation
Point coordinates from a typical LiDAR sensor \(\textbf{p} = (x, y, z)\) are mapped \(\Pi :\mathbb {R}^{n \times 3} \mapsto \mathbb {R}^{s_h \times s_w \times 3}\) to spherical coordinates, and finally to image coordinates, as defined by
where \((s_h, s_w)\) are the height and width of the desired projection image representation, \(f_v\) is the total vertical fieldofview of the sensor, and \(f_{vup}\) is the vertical fieldofview spanning upwards from the horizontal origin plane. The resulting list of image coordinates is used to construct a (x, y, z)channel image, which is the input for the next stage of the architecture.
Ground segmentation
As an additional contribution, we introduce a novel ground segmentation method, which is ultrafast and more accurate than most of the methods in the literature. Our ground segmentation combines a novel point sampling with the wellproven RANSAC plane fitting method. Figure 2 shows a graphical presentation of the method.
It is essential to segment the ground at the beginning of the pipeline for the following reasons: (a) reduce the number of points to process, (b) remove points that are invalid and not considered in later stages of the pipeline and (c) improve the performance of the clustering algorithm, as some clusters would be fused through ground points. We sample the ordered point cloud for potential ground plane points with two convolutional Sobel [19] inspired filters. These kernel filters are formulated as follows.
The filters are discrete differentiation operators that operate on the ordered point cloud tensor. In convolution, \(\textbf{S}_v\) and \(\textbf{S}_u\) yield approximations of vertical and horizontal derivatives on the ordered point cloud, respectively. The first term of \(\textbf{S}_v\) and the second term of \(\textbf{S}_u\) are the centers of the filters. The filters incorporate information of multiple neighboring points with discretized Gaussian function, where points closer to the center have a higher effect on the result of the computation. This is done because simply computing the derivatives with a subtraction between two neighboring points is too noisy in a typical LiDAR measurement. In a typical LiDAR sensor, the horizontal resolution is significantly higher than the vertical resolution. Therefore, the shape of \(\textbf{S}_u\) is 1x4, which means that it does not consider points on neighboring rows, as they are significantly more distant compared to points on neighboring columns (Fig. 2). This way, the filter can capture the approximation of the local derivative more accurately. The convolutions are computed with a range channel \(\sqrt{\textbf{X}^2 + \textbf{Y}^2} = \textbf{R} \in \mathbb {R}^{s_h\times s_w}\) and a height channel \(\textbf{Z} \in \mathbb {R}^{s_h\times s_w}\).
where \(*\) denotes the 2dimensional convolution operation. Matrices \(\textbf{F}_y\) and \(\textbf{F}_x\) denote an approximation of pointwise normal, as filters produce derivatives. Approximating the pointwise normal with this method requires only a small amount of computation while achieving satisfactory accuracy. We apply a threshold to \(\textbf{F}_y\) and \(\textbf{F}_x\), which gives us a mask of ground point samples:
Then, the samples \(\textbf{G}_{ samples } \in \mathbb {R}^{s_h \times s_w \times 3}\) are computed for the RANSAC algorithm:
where \((\textbf{X}, \textbf{Y}, \textbf{Z}) \in \mathbb {R}^{s_h \times s_w \times 3}\) contain the Cartesian coordinates of each point in the point cloud, and \(\odot\) indicates Hadamard product. Only the sampled points are considered in the random sampling of the RANSAC algorithm. Thus, only a handful of iterations are needed, compared to a case where the samples are taken from the entire point cloud, which reduces the computational cost. \(RANSAC ((\textbf{X}, \textbf{Y}, \textbf{Z}) _s, \textbf{G}_{ samples }) = \{a_1, a_2, a_3, a_4\} _s\) gives the parameters of detected planes, as we run this operation on sectors of points (Fig. 2). If the distance between a point and the detected plane in the corresponding sector satisfies a threshold, it is labeled as ground.
Anglebased clustering
The clustering algorithm is a key component of the proposal generator. We use the anglebased clustering method [20], mainly because it is fast (full scan from a 64 channel LiDAR in 10 ms with a single core of a 2.2 GHz CPU, reported in [20]) but also because it is sparsity invariant. Moreover, this method is less prone to fusing nearby objects under the same cluster because the clustering is based on an angle instead of a distance measurement. This is crucial and affects the performance of our architecture. However, our architecture does not have constraints that would prevent the usage of other standard clustering algorithms such as [21,22,23,24,25]. Still, we found through experimentation that these methods caused our architecture to perform worse. The anglebased clustering algorithm is sparsity invariant because it computes the angle between neighboring points. If this angle satisfies a threshold, the point is assigned under the label of the currently computed cluster. A breathfirst search (BFS) is implemented to add points to the current cluster, and a completed BFS indicates the completion of a cluster. The algorithm is fast since it takes advantage of the order of the point cloud, which means that finding the neighboring points is convenient.
Energybased outofdistribution detection
The energybased outofdistribution detection method performs well in image classification tasks, and it has been wellproven. Its implementation is also convenient because its input is the raw output of a neural network, unlike with methods such as [26,27,28,29,30]. Moreover, it is light computationally. Therefore, we define it as the baseline method on point cloud data.
The basic idea of an energybased function is to map each point of the input space to a nonprobabilistic scalar called energy \(E(\textbf{x}; f):\mathbb {R}^L \mapsto \mathbb {R}\) [31]. In our application, the output vectors of the classifier and bounding box estimator networks are mapped into their respective energy scalars that represent the distance to the class distribution. The method presented here is based on Liu et al. [32], where a modified version of the Helmholtz free energy from statistical mechanics is used as \(E(\textbf{x}; f)\) [33].
where \(\textbf{x}\) denotes the input of a neural network, T temperature scalar, L number of logits, and f a neural network. We utilize Equation (8) in training the classifier and the box estimator networks and during the inference time for separating IDs from OODs.
Network architectures and training objectives
This paper presents a novel implementation of an energybased OOD detection method for point cloud classifiers and 3D bounding box estimation networks. The proposed neural network architectures (Fig. 3) build on PointNet [34], applying some applicationspecific modifications to reduce computational cost and increase performance. These modifications include concatenating the proposal voxel position encoder features to leverage the observation angle and distance and simplifying the network in the main encoders and the fully connected layers. The main modification is implementing an energybased OOD learning objective to mitigate the false positive rate since our proposal generator is simple, resulting in vast OOD instances. Furthermore, we discovered that the respective critical and the upper bound point sets for classifier and box estimation networks differ for the same cluster of points. We exploit this phenomenon by implementing two separate ID passthrough modules for improved OOD detection. The classifier and the bounding box estimator inputs are transformed with RNet and TNet networks, respectively. RNet predicts a rotational matrix \(\textbf{T}_{rot} \in \mathbb {R}^{3 \times 3}\), which normalizes the rotation angle of the samples to simplify the classification task. Similarly, TNet predicts a transformation matrix \(\textbf{T}_{tra} \in \mathbb {R}^{3}\), which normalizes the location of the samples to simplify the bounding box estimation task.
Classifier training objective is to minimize classification crossentropy and ID/OOD squared hinge loss,
where \(\sigma\) is the softmax output of the classifier c. The energy loss is computed as
where \(\textbf{x}_{ ID }\) is an ID sample from KITTI training split (points inside a ground truth bounding box), and \(\textbf{x}_{ OOD }\) is an OOD sample (clustered points outside the ground truth bounding boxes) also from KITTI training split. Terms \(m_{ ID }\) and \(m_{ OOD }\) are the means of ID and OOD energies of the default trained network, respectively. They are used in Equation (10) to push down the energy of IDs, lift the energy of OODs, and expand the energy gap between them. Terms \(m_{ ID }\) and \(m_{ OOD }\) are precomputed and static during the training.
Bounding box estimator training objective combines a bounding box prediction and an energybased OOD detection objective. Center, size, and heading \((x_c, y_c, z_c, l, w, h, \theta )\) of a bounding box are parameterized to a combination of classes and residuals: \(\textbf{c} \in \mathbb {R}^{3}, \textbf{s} \in \mathbb {R}^{ N_S }, \textbf{s}_r \in \mathbb {R}^{ N_S \times 3 }, \textbf{h} \in \mathbb {R}^{ N_H }, \textbf{h}_r \in \mathbb {R}^{ N_H }\). The goal is to minimize the following function:
Our contribution is the definition of the term \(\mathcal {L}_{ boxenergy }\), which penalizes the network depending on the energy output. \(\mathcal {L}_{ c1reg }\) and \(\mathcal {L}_{ c2reg }\) are for TNet and center prediction, respectively, \(\mathcal {L}_{ hcls }\) and \(\mathcal {L}_{ hreg }\) are for heading, and \(\mathcal {L}_{ scls }\) and \(\mathcal {L}_{ sreg }\) are for size. Classification and regression tasks have cross entropy and Huber [35] losses, respectively. In addition, a cornel loss term is used. The corner loss helps to minimize both the class and regression losses as it penalizes the network based on the distance between the corners of the predicted and ground truth bounding boxes [36]. It is computed as:
where \(P_{h}^{**}\) denotes a corner of the flipped label box relative to the original label box \(P_{h}^{*}\).
Calculating the energy score is straightforward with the classifier since logits \(c(\textbf{x})\) are just the class probabilities. However, the output of the box estimation network \(b(\textbf{x}) \in \mathbb {R}^{3 + 4 \cdot N_S + 2 \cdot N_H}\) is a structure of heading and size class probabilities, and residual heading, size, and center residuals for each object class k (car, pedestrian, and cyclist). \(N_H\) and \(N_S\) denote the number of heading and size classes, respectively. Consequently, we must find an optimal way of using this special vector. We found experimentally that the heading \(\textbf{h} \in \mathbb {R}^{ N_H }\) and the size \(\textbf{s} \in \mathbb {R}^{ N_S }\) class vectors provide the best measure for ID/OOD separation, we ignore all residual elements because they did not have significant separation in the energy distributions. By ignoring all residual elements, a total of \(K \cdot 2\) individual logit vectors remain, where K indicates the number of classes.
We start by examining the energy distributions of the default trained box estimator. Initial energy gaps are found in all logit vectors compared to their respective nearOOD (NOOD) pairs, which makes it easier to train for larger energy gaps. This allows us to define a loss function from \(K \cdot 2\) individual elements that will encourage the model to learn larger energy gaps on each vector pair. A nearOOD is an OOD sample that has passed the classifier passthrough module. It is good to note that the box estimator has a more challenging task than the classifier because it tries to separate NOODs from IDs, unlike the classifier that separates OODs from IDs. The box estimator is often uncertain about the heading angle of the ID samples in \(\pi\) intervals. Thus, the value in the \(N_H /2\) offset of the maximum heading logit is changed to \(\infty\), which removes the contribution of that logit to the energy score since \(\lim _{b_i(\textbf{x}) \rightarrow \infty } e^{b_i(\textbf{x})}=0\):
We define the loss as a weighted sum of squared hinge loss of each ID/NOOD heading and size pair.
where \(\textbf{w}_k = 1/\sqrt{\textbf{N}_k}\) where \(\textbf{N}_k\) denotes the number of samples in an ID class k, \(G=2\) indicates the number of vectors per class.
We discovered that the critical and upper bound point sets for a sample \(\textbf{x}\) significantly differ in the classification and box estimation tasks. The classifier and the box estimator learn different sets of features of \(\textbf{x}\). This would explain why the box estimator network has different energy distributions than the classifier network.
ID passthrough modules
During inference time, energy score for each sample \(\textbf{x}\) is computed from logits \(c(\textbf{x})\), \(b(\textbf{x})_{\textbf{h}}\), and \(b(\textbf{x})_{\textbf{s}}\) of the networks using equation (8). Since the classifier and bounding box estimator are optimized to their respective tasks, we implemented two separate ID passthrough modules to have more effective ID/OOD separation. The ID/OOD detection is computed with thresholds \(\gamma _c\) and \(\gamma _b(k, g)\). The passthrough modules for the classifier and the bounding box estimator are defined, respectively, as.
where a sample \(\textbf{x}\) is an ID if classifier and box estimation energies are lower than \(\gamma _c\) and \(\gamma _b\), respectively; otherwise, it is an OOD.
Proposal voxel location encoder
The road user proposals are normalized on the origin before the classifier because it increases performance [34]. However, distance and observation angle information is lost during this process (Fig. 4a). To make use of this information, we propose a proposal voxel location encoder (PVLE) module (Fig. 4b), which aims to improve the point cloud proposal classification and 3D bounding box estimation tasks. The module processes voxel coordinates of the proposals and outputs learned features. The intuition behind this method is that the observation angle and the distance of an ID proposal carry useful information that can be used to improve the detection performance. In practice, the arithmetic mean point of a given proposal is first voxelized and then encoded into a small feature vector, which is concatenated with the global features of the classifier and bounding box networks as well as the RNet and TNet. Now, the networks can leverage the observation angle and distance. The design of this encoder (MLP: 3,64,32) is inspired by the first encoder layer of the PointNet (MLP: 3,64). The output feature vector has a length 32 because it is the closest \(2^n\) to the voxel grid resolution \(3 \times 10\). We utilize similar layer dimensions to the vanilla PointNet because the voxel coordinate input has the same shape as a point input in the vanilla PointNet.
Proposals shift to different voxel locations if the vehicle operates on uneven ground and the sensor pivots. With a spherical coordinate system, the magnitude of the shift is unrelated to the location of the proposal since angle limits define voxel boundaries. Therefore, a spherical coordinate system is more robust in this scenario than Cartesian and cylindrical coordinate systems.
Experiments
Experiments are conducted in three datasets. KITTI [37] and SemanticKITTI [38] are used to measure the accuracy of the 3D object detection and ground segmentation, respectively. Moreover, detection accuracy is also measured on our dataset with annotated road users collected with a compact mobile robot platform (Fig. 5). An indepth quantitative analysis is carried out to validate our design choices. Lastly, qualitative results and a discussion of the strengths and weaknesses of our methods are presented.
Implementation details
Inference time is benchmarked with a 4.0 GHz CPU. Both classifier and box estimator are trained for 200 epochs with a learning rate of 0.001 and with the Adam optimizer [39]. The resolution of the voxel grid is \(\theta = 10^{\circ }\), and \(r = 7.5\) m. Proposals that have a higher number of points than 128 are sampled down to 128.
Data
Set 1. The KITTI dataset [37] is divided into training, validation, and test splits. All models are trained with the training split. The input of our architecture is instance samples; therefore, IDs, OODs, and NOODs are extracted from the training set into a separate set in the following manner. First, points inside ground truth bounding boxes are extracted as ID samples. Second, the remaining points are fed through the proposal generator. The output is saved into an auxiliary OOD dataset. Finally, OODs are fed through a trained classifier, the energy threshold is applied, and the samples that pass are saved into their respective class as NOODs.
Set 2. We also conduct tests with our dataset. It is collected with a compact mobile robot platform with a Velodyne VLP16 LiDAR from a sidewalk area. There are 2130 and 1733 3D bounding box labels for cars and pedestrians in 1096 scans. The height of the sensor mount is 0.5 m from the ground. We want to emphasize that this dataset is used only for testing, not training. The detailed specifications are listed in Table 1.
Set 3. The SemanticKITTI [38] is used to measure the accuracy of the ground segmentation methods. It includes pointwise labels for the ground surface in traffic scenarios and thus is ideal for our experiment.
Overall performance
The performance on the KITTI dataset is presented in Table 2. For the comparison, we chose stateoftheart methods both in terms of speed and accuracy [9, 15], and frequent methods that are performing on a similar level to our method [3, 4, 40,41,42]. They use voxel and bird’s eye view modalities, which make a fair comparison as our method utilizes voxels and pointbased methods. Moreover, all methods are implemented using PyTorch [43] to have a more fair comparison.
Our method achieves similar AP on pedestrian and cyclist detection to other methods, which is impressive considering the computational cost of our approach. Table 2 compares the performance in terms of FPS and mAP. Our method is the only one that achieves realtime performance on a CPU.
Figure 6 illustrates the separability between ID and OOD samples in the classifier. The plot includes \(10^4\) samples from IDs and OODs, respectively. In the dataset, the actual partitions of IDs and OODs are approximately 6% and 94%, respectively. The energy distribution of cars is much narrower compared to other classes because the amount of car samples in the training set is more significant compared to other classes. This allows the network to learn the difference between cars and OODs better.
Figure 7 shows the energy distributions in the bounding box estimation network. The plot includes \(10^4\) samples from IDs and NOODs, respectively. The energy distributions have notable differences depending on the class and the vector type. This is a significant result, given that the classifier falsely detected these NOOD samples as ID samples. Car class has the best ID/NOOD separability. We suspect this is due to a large sample count in the training data compared to pedestrian and cyclist classes.
Performance on the 16channel LiDAR dataset
Models are trained with the KITTI training set. Performance is tested with our annotated 16channel LiDAR data. Therefore, this study investigates the performance of models on a lowresolution point cloud trained with a highresolution point cloud. Furthermore, labels span full 360\(^\circ\) in our dataset, unlike in the KITTI dataset, where labels are limited to approximately 90\(^\circ\) sector. Our method performs on par with the stateoftheart while achieving realtime performance on the CPU (Table 3). Our method performs the best in the study that measures the performance difference from the KITTI dataset to the low resolution and low perspective point cloud dataset (Table 4).
Ablation study
Table 5 illustrates the contributions of the proposed modules to the mAP and FPS. Modules improve the mAP and FPS significantly. The first ID passthrough module improves the speed significantly, as it reduces the samples processed by the box estimator. Based on the ablation of the PVLE modules, the location carries valuable information regarding the 3D road user detection task, which satisfies the hypothesis discussed in the methods. The PVLE module for the box estimator slightly improves the car and cyclist classes while worsening it for the pedestrian class. This suggests that the module can be prone to overfit if the total number of samples is small, as it is for the pedestrian class.
Ground segmentation
The performance test results are summarized in Table 6. Our ground segmentation method is compared to frequent and stateoftheart methods in the literature. It performs well in terms of computational cost, accuracy, and IOU. This is due to our effective sampling method, which reduces the iterations needed in the RANSAC function. Furthermore, fitting multiple planes in sensor azimuth direction yields more accurate segmentation, especially on an uneven ground surface.
Qualitative results and discussion
For qualitative analysis, we have randomly picked detection results. They are visualized in Fig. 8. Videos displaying the detection performance can be found at.^{Footnote 1} The geometrical proposal generator reduces the computational requirement significantly while still achieving mAP comparable to the stateoftheart. The tradeoff suggests that not using learned proposals is justified. The proposal generator has another benefit, too. It allows data streaming, meaning that the point clouds can be processed in smaller sectors to start the processing earlier than in whole scan approaches. This will decrease the latency of the detection significantly. The limitation of the proposal generator is cluster fusion when physical contact of the road users is visible to the sensor. This could be solved using another proposal generator, such as furthest point sampling. However, this is not the accuracy bottleneck of our approach since the performance increases when the IOU threshold is decreased. This is especially apparent with the car class, which has a harsh 0.7 IOU threshold. Although the car class separated better from the OODs than the pedestrian and cyclist classes, the final bounding box predictions were worse with the car class. Therefore, the accuracy bottleneck is in the bounding box estimator network caused by incorrectly predicted bounding boxes on correctly classified proposals. Thus, improvements to this network would be a good subject of interest in future research.
How does the OOD training objective affect the accuracy of the ID classification and bounding box estimation? By adding an energybased OOD training objective, networks learn not only the original task but also the energybased task. This decreases the performance of the original task slightly. However, many false positives are removed using the energy values, which increase AP more than rare classification errors decrease it.
Is the inductive bias of the proposal voxel location encoder beneficial? The bias of the module is beneficial as the voxel grid resolution is relatively low. This results in more proposals for a single voxel location, which results in a more general representation of the location. Hence, the model is not prone to overfit to voxel location information. This is indicated by the results in Table 5.
Conclusion
This paper presented a novel architecture for the 3D road user detection task. The architecture has an extremely low computational requirement; therefore, it is suitable for applications with limited computational resources. An impressive 15.2 FPS was achieved with a 4.0 GHz CPUonly implementation while having comparable accuracy to the stateoftheart. Furthermore, our architecture performed the best on the lowresolution LiDAR dataset. The architecture is based on a geometrical proposal generator and outofdistribution and locationaware PointNets. To our surprise, the accuracy bottleneck was not the proposal generator but the bounding box estimator. In the future, improvements to the bounding box estimator could be carried out, and other OOD detection methods could be studied in the 3D road user detection task.
Availability of data and materials
The KITTI dataset is publicly available, and the dataset generated during the current study is available from the corresponding author upon reasonable request.
References
Horowitz M. 1.1 computing’s energy problem (and what we can do about it). In: 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), 2014; pp. 10–14. IEEE.
Huang C, Nguyen VD, Abdelzad V, Mannes CG, Rowe L, Therien B, Salay R, Czarnecki K Outofdistribution detection for lidarbased 3d object detection. In: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), 2022; pp. 4265–4271.
Zhou Y, Tuzel O. Voxelnet: endtoend learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018; pp. 4490–4499.
Yan Y, Mao Y, Li B. Second: sparsely embedded convolutional detection. Sensors. 2018;18(10):3337.
Fan L, Xiong X, Wang F, Wang N, Zhang Z. Rangedet: in defense of range view for lidarbased 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021; pp. 2918–2927.
Chai Y, Sun P, Ngiam J, Wang W, Caine B, Vasudevan V, Zhang X, Anguelov D. To the point: Efficient 3d object detection in the range image with graph convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021; pp. 16000–16009.
Liang Z, Zhang M, Zhang Z, Zhao X, Pu S. Rangercnn: towards fast and accurate 3d object detection with range image representation, 2020; arXiv preprint arXiv:2009.00206.
Meyer GP, Laddha A, Kee E, VallespiGonzalez C, Wellington CK. Lasernet: an efficient probabilistic 3d object detector for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019; pp. 12677–12686.
Lang AH, Vora S, Caesar H, Zhou L, Yang J, Beijbom O. Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019; pp. 12697–12705.
Yang B, Luo W, Urtasun R. Pixor: realtime 3d object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018; pp. 7652–7660.
Zheng W, Tang W, Jiang L, Fu CW. Sessd: selfensembling singlestage object detector from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021; pp. 14494–14503.
Zheng W, Tang W, Chen S, Jiang L, Fu CW. Ciassd: confident iouaware singlestage object detector from point cloud. arXiv preprint, 2020; arXiv:2012.03015.
Shi W, Rajkumar R. Pointgnn: graph neural network for 3d object detection in a point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020; pp. 1711–1719.
Qi CR, Yi L, Su H, Guibas LJ. Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Proc Syst. 2017;30.
Zhang Y, Hu Q, Xu G, Ma Y, Wan J, Guo Y. Not all points are equal: learning highly efficient pointbased detectors for 3d lidar point clouds. Accepted to CVPR 2022, arXiv preprint, 2022; arXiv:2203.11139.
Qi CR, Litany O, He K, Guibas LJ. Deep Hough voting for 3d object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019; pp. 9277–9286.
Shi S, Wang X, Li H. Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019; pp. 770–779.
Ngiam J, Caine B, Han W, Yang B, Chai Y, Sun P, Zhou Y, Yi X, Alsharif O, Nguyen P, et al. Starnet: targeted computation for object detection in point clouds. arXiv preprint , 2019;arXiv:1908.11069.
Sobel I. An isotropic 3x3 image gradient operator. Presentation at Stanford A.I. Project 1968; 2014.
Bogoslavskyi I, Stachniss C. Fast range imagebased segmentation of sparse 3d laser scans for online operation. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016; pp. 163–169. IEEE.
Zhao Y, Zhang X, Huang X. A technical survey and evaluation of traditional point cloud clustering methods for lidar panoptic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021; pp. 2464–2473.
Rusu RB. Semantic 3d object maps for everyday manipulation in human living environments. KIKünstliche Intelligenz. 2010;24(4):345–8.
Rusu RB, Cousins S. 3d is here: point cloud library (pcl). In: 2011 IEEE International Conference on Robotics and Automation, 2011; pp. 1–4. IEEE.
Papon J, Abramov A, Schoeler M, Worgotter F. Voxel cloud connectivity segmentationsupervoxels for point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2013; pp. 2027–2034.
Zermas D, Izzat I, Papanikolopoulos N. Fast segmentation of 3d point clouds: a paradigm on lidar data for autonomous vehicle applications. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017; pp. 5067–5073. IEEE.
Dong X, Guo J, Li A, Ting WT, Liu C, Kung H. Neural mean discrepancy for efficient outofdistribution detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022; pp. 19217–19227.
Liang S, Li Y, Srikant R. Enhancing the reliability of outofdistribution image detection in neural networks. arXiv preprint, 2017; arXiv:1706.02690.
Hsu YC, Shen Y, Jin H, Kira Z. Generalized odin: detecting outofdistribution image without learning from outofdistribution data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020; pp. 10951–10960.
Ren J, Fort S, Liu J, Roy AG, Padhy S, Lakshminarayanan B. A simple fix to mahalanobis distance for improving nearood detection. arXiv preprint, 2021; arXiv:2106.09022.
Lee K, Lee K, Lee H, Shin J. A simple unified framework for detecting outofdistribution samples and adversarial attacks. Adv Neural Inf Proc Syst 2018; 31.
LeCun Y, Chopra S, Hadsell R, Ranzato M, Huang F. A tutorial on energybased learning. Predicting structured Data. 2006; 1(0).
Liu W, Wang X, Owens J, Li Y. Energybased outofdistribution detection. Adv Neural Inf Process Syst. 2020;33:21464–75.
Hinton GE, Zemel R. Autoencoders, minimum description length and helmholtz free energy. Adv Neural Inf Proc Syst. 1993; 6.
Qi CR, Su H, Mo K, Guibas LJ. Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017; pp. 652–660.
Huber PJ. Robust estimation of a location parameter. In: Breakthroughs in Statistics, 1992; pp. 492–518. Springer.
Qi CR, Liu W, Wu C, Su H, Guibas LJ. Frustum pointnets for 3d object detection from rgbd data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018; pp. 918–927.
Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The Kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012; pp. 3354–3361 . IEEE.
Behley J, Garbade M, Milioto A, Quenzel J, Behnke S, Gall J, Stachniss C. Towards 3D LiDARbased semantic scene understanding of 3D point cloud sequences: the SemanticKITTI Dataset. Int J Robot Res. 2021;40(8–9):959–67.
Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Beltrán J, Guindel C, Moreno FM, Cruzado D, Garcia F, De La Escalera A. Birdnet: a 3d object detection framework from lidar information. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), 2018; pp. 3517–3523. IEEE.
Barrera A, Beltrán J, Guindel C, Iglesias JA, García F. Birdnet+: twostage 3d object detection in lidar through a sparsityinvariant bird’s eye view. IEEE Access. 2021;9:160299–316.
Ku J, Mozifian M, Lee J, Harakeh A, Waslander SL. Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018; pp. 1–8. IEEE.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. Pytorch: an imperative style, highperformance deep learning library. Adv Neural Inf Proc Syst. 2019; 32.
Ouyang Z, Dong X, Cui J, Niu J, Guizani M. Pvenconet: fast object detection based on colored point cloud. IEEE Trans Intel Transport Syst. 2021.
Liu K, Wang W, Tharmarasa R, Wang J, Zuo Y. Ground surface filtering of 3d point clouds based on hybrid regression technique. IEEE Access. 2019;7:23270–84.
Velas M, Spanel M, Hradis M, Herout A. Cnn for very fast ground segmentation in velodyne lidar data. In: 2018 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), 2018; pp. 97–103. IEEE.
Rummelhard L, Paigwar A, Nègre A, Laugier C. Ground estimation and point cloud segmentation using spatiotemporal conditional random field. In: 2017 IEEE Intelligent Vehicles Symposium (IV),2017; pp. 1105–1110. IEEE.
Paigwar A, Erkent Ö, SierraGonzalez D, Laugier C. Gndnet: fast ground plane estimation and point cloud segmentation for autonomous vehicles. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020; pp. 2150–2156 . IEEE.
Acknowledgements
Not applicable.
Funding
The Helsinki Institute of Physics funded this work.
Author information
Authors and Affiliations
Contributions
AS developed the research idea, implemented the algorithm, designed and executed the experiments, and drafted the manuscript. EA helped with the data collection and labeled the dataset. GD provided fruitful comments on the interpretability of the schematic figures. RO and KT provided general research guidance, managed the research workflow, and revised the initial versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Seppänen, A., Alamikkotervo, E., Ojala, R. et al. Outofdistribution and locationaware PointNets for realtime 3D road user detection without a GPU. J Big Data 11, 2 (2024). https://doi.org/10.1186/s40537023008595
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537023008595