No GPU? No problem: an ultra fast 3D detection of road users with a simple proposal generator and energy-based out-of-distribution PointNets

This paper presents a novel architecture for point cloud road user detection, which is based on a classical point cloud proposal generator approach, that utilizes simple geometrical rules. New methods are coupled with this technique to achieve extremely small computational requirement, and mAP that is comparable to the state-of-the-art. The idea is to speciﬁcally exploit geometrical rules in hopes of faster performance. The typical downsides of this approach, e.g. global context loss, are tackled in this paper, and solutions are presented. This approach allows real-time performance on a single core CPU, which is not the case with end-to-end solutions presented in the state-of-the-art. We have evaluated the performance of the method with the public KITTI dataset, and with our own annotated dataset collected with a small mobile robot platform. Moreover, we also present a novel ground segmentation method, which is evaluated with the public SemanticKITTI dataset.


Introduction
Detection of road users is an essential task for autonomous vehicles and mobile robots, and it is crucial, for instance, in obstacle avoidance and route planning tasks.The task focuses solely on detecting road users, i.e., cars, pedestrians, and cyclists.This allows the utilization of class-specific biases; for example, objects are expected to be on the ground surface.The task differs from the more general object detection task, which aims to detect any object.Many object detection methods have been developed for detecting road users from point clouds.However, much of the efforts have been directed toward developing methods that achieve high accuracy, and significantly less effort has been directed toward low-latency approaches.The present study addresses this issue with a novel architecture developed with attention to latency.An algorithm that has low latency and computational cost has a large number of benefits.It allows the better use of high-frequency (> 20 Hz) sensors as the algorithm is also high-frequency.A light algorithm decreases energy consumption, resulting in a longer operation time for mobile robots.It also removes the requirement for a GPU, enabling lighter weight and less expensive hardware, which is crucial in many applications.
To tackle the issue of latency in neural networks, attention should be directed towards the memory (DRAM) reading, as stated by Horowitz et al. [1].This paper takes a unique data-dependent approach to 3D road user detection to achieve low latency.Unlike most methods in the literature, our approach saves memory by discarding irrelevant parts of the input point cloud.Specifically, a simple algorithm removes coarse parts of the point clouds, such as the ground plane.Then, more complex algorithms, deep neural networks, are used for making the detections.Therefore, as the amount of data decreases, the complexity of the algorithm increases, and the neural network processes only the filtered relevant data.In practice, object proposals are generated with ground removal and clustering, and then a novel out-of-distribution-and location-aware PointNets make the detections from the proposals.We filter out-of-distribution proposals throughout the pipeline to reduce memory and computational footprint.We are also the first to implement an out-of-distribution detection method on classification and bounding box estimation tasks with point cloud object proposals.This approach achieves superior latency and comparable accuracy in 3D road user detection to the state-of-the-art.Huang et al. [2] studied out-of-distribution detection methods directly on 3D object detection, showing the applicability of such methods.
As stated, the literature has many point cloud object detection methods tested for 3D road user detection.However, most of them require a powerful GPU unit to run in realtime, which some mobile robots do not have due to, e.g., power, cost, weight, or size restrictions.To complement this under-explored area, we develop an algorithm that runs in real-time on a standard CPU.We summarize the contributions of this paper as follows.
• A novel architecture for 3D road user detection on point cloud data running in realtime on a CPU.• To the best of our knowledge, we are the first to implement an out-of-distribution detection method on classification and bounding box estimation tasks with point cloud object proposals.• A novel proposal voxel location encoder, which improves the accuracy of the models by a significant margin.• A ground segmentation method, which outperforms competitive methods.We present simple convolutional filters for sampling ground points for the plane fit.• A study on a 3D road user detection task with models trained only with the KITTI high-resolution point cloud data and tested with low-resolution and low-perspective point cloud data.

Related work
This section presents frequent 3D object detection approaches that are well-proven and provide a fair comparison to our approach.The approaches are divided into voxel, graph, projection image, point, and bird's eye view methods.VoxelNet [3] partitions the point cloud into voxel partitions.Points within a voxel partition are encoded into a vector representation characterizing the shape information.3D convolution is performed on the voxels to predict bounding boxes and classes.SECOND [4] improved the accuracy and reduced computational cost of VoxelNet by preprocessing the point cloud by dropping voxels that include no points.
Authors of [5][6][7][8] use methods based on a range image.A range image is a point cloud projection on a spherical surface.The benefit of using projection is to preserve the information of neighboring measurements.However, the authors claim that some information is lost during the projection.The authors use a 2D convolutional neural network to predict bounding boxes and classes.
One approach is to use a bird's eye view image as input for the detection algorithm [9][10][11][12].PointPillars [9] is a popular method using this approach.It constructs pillar-like features from the point cloud with a neural network and forms a bird's eye view image.Then, a 2D convolutional neural network is applied to this image, making predictions.Bird's eye view is a convenient representation of the point cloud because objects are rarely stacked on top of each other in, for example, outdoor driving scenarios, given an optimal vertical field-of-view of the sensor.
More recently, graph neural networks (GNNs) have been implemented into the point cloud object detection task.Shi et al. [13] were the first ones to do so.In their method, point clouds are preprocessed into a graph representation utilizing a grouping method similar to PointNet++ [14].Then, a GNN architecture predicts the classes and bounding boxes.At the time of publication, they achieved superior results in several public benchmarks, proving that graph-based pre-processing works well in the object detection task from point clouds.
Point-based methods use raw point clouds as an input [15][16][17].One benefit is that no pre-processing is needed, e.g., voxelization, range-or bird's eye view-projection.Typically, this results in a lower computational cost overall.However, preserving the geometrical context is challenging since points are processed individually.Ngiam et al. [18] proposed that targeting computation to certain regions benefits computational cost and generalizability.They generated proposals with the furthest and random point sampling and utilized a neural network architecture for estimating bounding boxes.

Methods
A general schematic of the proposed architecture is presented in Fig. 1.The basic principle is to generate simple, unclassified proposals and utilize classifiers that differentiate between the in-distribution (ID) and out-of-distribution (OOD) proposals.This is implemented using novel energy-based OOD PointNets.The first PointNet predicts the class probability vectors and energy scores, which are used for discarding the first batch of OODs.This is done in the first ID pass-through module.Then, 3D bounding boxes are predicted for the remaining proposals with an alternative PointNet, which also predicts energy scores.Further, the second ID pass-through module filters out the remaining OODs based on the bounding box energy scores.In addition, a novel proposal voxel location encoder (PVLE) is utilized to preserve location information otherwise lost in the proposal normalization process.PVLE attempts to increase the accuracy of the neural networks without adding a significant amount of computation.The increased accuracy will prove that proposal location information has value in the proposal classification and bounding box estimation tasks.The architecture aims to achieve low computational requirements by discarding information hierarchically while preserving helpful information regarding the road user detection task.

Ordered point cloud representation
Point coordinates from a typical LiDAR sensor p = (x, y, z) are mapped : R n×3 � → R s h ×s w ×3 to spherical coordinates, and finally to image coordinates, as defined by where (s h , s w ) are the height and width of the desired projection image representation, f v is the total vertical field-of-view of the sensor, and f vup is the vertical field-of-view span- ning upwards from the horizontal origin plane.The resulting list of image coordinates is used to construct a (x, y, z)-channel image, which is the input for the next stage of the architecture.

Ground segmentation
As an additional contribution, we introduce a novel ground segmentation method, which is ultra-fast and more accurate than most of the methods in the literature.Our ground segmentation combines a novel point sampling with the well-proven RANSAC plane fitting method.Figure 2 shows a graphical presentation of the method.
It is essential to segment the ground at the beginning of the pipeline for the following reasons: (a) reduce the number of points to process, (b) remove points that are invalid and not considered in later stages of the pipeline and (c) improve the performance of the clustering algorithm, as some clusters would be fused through ground (1)  points.We sample the ordered point cloud for potential ground plane points with two convolutional Sobel [19] inspired filters.These kernel filters are formulated as follows.
The filters are discrete differentiation operators that operate on the ordered point cloud tensor.In convolution, S v and S u yield approximations of vertical and horizontal deriva- tives on the ordered point cloud, respectively.The first term of S v and the second term of S u are the centers of the filters.The filters incorporate information of multiple neighbor- ing points with discretized Gaussian function, where points closer to the center have a higher effect on the result of the computation.This is done because simply computing the derivatives with a subtraction between two neighboring points is too noisy in a typical LiDAR measurement.In a typical LiDAR sensor, the horizontal resolution is significantly higher than the vertical resolution.Therefore, the shape of S u is 1x4, which means that it does not consider points on neighboring rows, as they are significantly more distant compared to points on neighboring columns (Fig. 2).This way, the filter can capture the approximation of the local derivative more accurately.The convolutions are computed with a range channel where * denotes the 2-dimensional convolution operation.Matrices F y and F x denote an approximation of point-wise normal, as filters produce derivatives.Approximating the point-wise normal with this method requires only a small amount of computation while achieving satisfactory accuracy.We apply a threshold to F y and F x , which gives us a mask of ground point samples: (2) ( An illustration of the ground segmentation method.On the left, points that passed the filter are indicated by green, and points between the dashed lines are segmented ground points.On the right, points surrounded by ellipses are considered by S v and S u filters when computing the point marked with a star.A plane is fitted to each sector of points.Note that R = √ X 2 + Y 2 , i.e., the distance to the z-axis Then, the samples G samples ∈ R s h ×s w ×3 are computed for the RANSAC algorithm: where (X, Y, Z) ∈ R s h ×s w ×3 contain the Cartesian coordinates of each point in the point cloud, and ⊙ indicates Hadamard product.Only the sampled points are considered in the random sampling of the RANSAC algorithm.Thus, only a handful of iterations are needed, compared to a case where the samples are taken from the entire point cloud, which reduces the computational cost.RANSAC((X, Y, Z) s , G samples ) = {a 1 , a 2 , a 3 , a 4 } s gives the parameters of detected planes, as we run this operation on sectors of points (Fig. 2).If the distance between a point and the detected plane in the corresponding sector satisfies a threshold, it is labeled as ground.

Angle-based clustering
The clustering algorithm is a key component of the proposal generator.We use the angle-based clustering method [20], mainly because it is fast (full scan from a 64 channel LiDAR in 10 ms with a single core of a 2.2 GHz CPU, reported in [20]) but also because it is sparsity invariant.Moreover, this method is less prone to fusing nearby objects under the same cluster because the clustering is based on an angle instead of a distance measurement.This is crucial and affects the performance of our architecture.However, our architecture does not have constraints that would prevent the usage of other standard clustering algorithms such as [21][22][23][24][25].Still, we found through experimentation that these methods caused our architecture to perform worse.The angle-based clustering algorithm is sparsity invariant because it computes the angle between neighboring points.If this angle satisfies a threshold, the point is assigned under the label of the currently computed cluster.A breath-first search (BFS) is implemented to add points to the current cluster, and a completed BFS indicates the completion of a cluster.The algorithm is fast since it takes advantage of the order of the point cloud, which means that finding the neighboring points is convenient.

Energy-based out-of-distribution detection
The energy-based out-of-distribution detection method performs well in image classification tasks, and it has been well-proven.Its implementation is also convenient because its input is the raw output of a neural network, unlike with methods such as [26][27][28][29][30].Moreover, it is light computationally.Therefore, we define it as the baseline method on point cloud data. (4) (5) The basic idea of an energy-based function is to map each point of the input space to a non-probabilistic scalar called energy E(x; f ) : R L � → R [31].In our application, the output vectors of the classifier and bounding box estimator networks are mapped into their respective energy scalars that represent the distance to the class distribution.The method presented here is based on Liu et al. [32], where a modified version of the Helmholtz free energy from statistical mechanics is used as E(x; f ) [33].
where x denotes the input of a neural network, T temperature scalar, L number of logits, and f a neural network.We utilize Equation (8) in training the classifier and the box estimator networks and during the inference time for separating IDs from OODs.

Network architectures and training objectives
This paper presents a novel implementation of an energy-based OOD detection method for point cloud classifiers and 3D bounding box estimation networks.The proposed neural network architectures (Fig. 3) build on PointNet [34], applying some applicationspecific modifications to reduce computational cost and increase performance.These modifications include concatenating the proposal voxel position encoder features to leverage the observation angle and distance and simplifying the network in the main encoders and the fully connected layers.The main modification is implementing an energy-based OOD learning objective to mitigate the false positive rate since our proposal generator is simple, resulting in vast OOD instances.Furthermore, we discovered that the respective critical and the upper bound point sets for classifier and box estimation networks differ for the same cluster of points.We exploit this phenomenon by implementing two separate ID pass-through modules for improved OOD detection.The classifier and the bounding box estimator inputs are transformed with R-Net and T-Net networks, respectively.R-Net predicts a rotational matrix T rot ∈ R 3×3 , which normal- izes the rotation angle of the samples to simplify the classification task.Similarly, T-Net predicts a transformation matrix T tra ∈ R 3 , which normalizes the location of the sam- ples to simplify the bounding box estimation task.
Classifier training objective is to minimize classification cross-entropy and ID/OOD squared hinge loss, (8) where σ is the softmax output of the classifier c.The energy loss is computed as where x ID is an ID sample from KITTI training split (points inside a ground truth bound- ing box), and x OOD is an OOD sample (clustered points outside the ground truth bound- ing boxes) also from KITTI training split.Terms m ID and m OOD are the means of ID and OOD energies of the default trained network, respectively.They are used in Equation (10) to push down the energy of IDs, lift the energy of OODs, and expand the energy gap between them.Terms m ID and m OOD are pre-computed and static during the training.
Bounding box estimator training objective combines a bounding box prediction and an energy-based OOD detection objective.Center, size, and heading (x c , y c , z c , l, w, h, θ) of a bounding box are parameterized to a combination of classes and residuals: The goal is to minimize the following function: Our contribution is the definition of the term L box−energy , which penalizes the network depending on the energy output.L c1−reg and L c2−reg are for T-Net and center prediction, respectively, L h−cls and L h−reg are for heading, and L s−cls and L s−reg are for size.Classi- fication and regression tasks have cross entropy and Huber [35] losses, respectively.In addition, a cornel loss term is used.The corner loss helps to minimize both the class and regression losses as it penalizes the network based on the distance between the corners of the predicted and ground truth bounding boxes [36].It is computed as: where P * * h denotes a corner of the flipped label box relative to the original label box P * h .Calculating the energy score is straightforward with the classifier since logits c(x) are just the class probabilities.However, the output of the box estimation network b(x) ∈ R 3+4•N S +2•N H is a structure of heading and size class probabilities, and residual heading, size, and center residuals for each object class k (car, pedestrian, and cyclist).N H and N S denote the number of heading and size classes, respectively.Consequently, we must find an optimal way of using this special vector.We found experimentally that the heading h ∈ R N H and the size s ∈ R N S class vectors provide the best measure for ID/ OOD separation, we ignore all residual elements because they did not have significant separation in the energy distributions.By ignoring all residual elements, a total of K • 2 individual logit vectors remain, where K indicates the number of classes.
We start by examining the energy distributions of the default trained box estimator.Initial energy gaps are found in all logit vectors compared to their respective near-OOD (9) (NOOD) pairs, which makes it easier to train for larger energy gaps.This allows us to define a loss function from K • 2 individual elements that will encourage the model to learn larger energy gaps on each vector pair.A near-OOD is an OOD sample that has passed the classifier pass-through module.It is good to note that the box estimator has a more challenging task than the classifier because it tries to separate NOODs from IDs, unlike the classifier that separates OODs from IDs.The box estimator is often uncertain about the heading angle of the ID samples in π intervals.Thus, the value in the N H /2 offset of the maximum heading logit is changed to −∞ , which removes the contribution of that logit to the energy score since lim b i (x)→−∞ e b i (x) = 0: We define the loss as a weighted sum of squared hinge loss of each ID/NOOD heading and size pair.
where w k = 1/ √ N k where N k denotes the number of samples in an ID class k, G = 2 indicates the number of vectors per class.
We discovered that the critical and upper bound point sets for x significantly differ in the classification and box estimation tasks.The classifier and the box estimator learn different sets of features of x .This would explain why the box estimator network has different energy distributions than the classifier network.

ID pass-through modules
During inference time, energy score for each sample x is computed from logits c(x) , b(x) h , and b(x) s of the networks using equation (8).Since the classifier and bounding box estima- tor are optimized to their respective tasks, we implemented two separate ID pass-through modules to have more effective ID/OOD separation.The ID/OOD detection is computed with thresholds γ c and γ b (k, g) .The pass-through modules for the classifier and the bound- ing box estimator are defined, respectively, as.
where a sample x is an ID if classifier and box estimation energies are lower than γ c and γ b , respectively; otherwise, it is an OOD.(13)

Proposal voxel location encoder
The road user proposals are normalized on the origin before the classifier because it increases performance [34].However, distance and observation angle information is lost during this process (Fig. 4a).To make use of this information, we propose a proposal voxel location encoder (PVLE) module (Fig. 4b), which aims to improve the point cloud proposal classification and 3D bounding box estimation tasks.The module processes voxel coordinates of the proposals and outputs learned features.The intuition behind this method is that the observation angle and the distance of an ID proposal carry useful information that can be used to improve the detection performance.In practice, the arithmetic mean point of a given proposal is first voxelized and then encoded into a small feature vector, which is concatenated with the global features of the classifier and bounding box networks as well as the R-Net and T-Net.Now, the networks can leverage the observation angle and distance.The design of this encoder (MLP: 3,64,32) is inspired by the first encoder layer of the PointNet (MLP: 3,64).The output feature vector has a length 32 because it is the closest 2 n to the voxel grid resolution 3 × 10 .We utilize simi- lar layer dimensions to the vanilla PointNet because the voxel coordinate input has the same shape as a point input in the vanilla PointNet.
Proposals shift to different voxel locations if the vehicle operates on uneven ground and the sensor pivots.With a spherical coordinate system, the magnitude of the shift is unrelated to the location of the proposal since angle limits define voxel boundaries.Therefore, a spherical coordinate system is more robust in this scenario than Cartesian and cylindrical coordinate systems.

Experiments
Experiments are conducted in three datasets.KITTI [37] and SemanticKITTI [38] are used to measure the accuracy of the 3D object detection and ground segmentation, respectively.Moreover, detection accuracy is also measured on our dataset with annotated road users collected with a compact mobile robot platform (Fig. 5).An in-depth quantitative analysis is carried out to validate our design choices.Lastly, qualitative results and a discussion of the strengths and weaknesses of our methods are presented.

Implementation details
Inference time is benchmarked with a 4.0 GHz CPU.Both classifier and box estimator are trained for 200 epochs with a learning rate of 0.001 and with the Adam optimizer [39].The resolution of the voxel grid is θ = 10 • , and r = 7.5 m.Proposals that have a higher number of points than 128 are sampled down to 128.

Data
Set 1.The KITTI dataset [37] is divided into training, validation, and test splits.All models are trained with the training split.The input of our architecture is instance samples; therefore, IDs, OODs, and NOODs are extracted from the training set into a separate set in the following manner.First, points inside ground truth bounding boxes are extracted as ID samples.Second, the remaining points are fed through the proposal generator.The output is saved into an auxiliary OOD dataset.Finally, OODs are fed through a trained classifier, the energy threshold is applied, and the samples that pass are saved into their respective class as NOODs.
Set 2. We also conduct tests with our dataset.It is collected with a compact mobile robot platform with a Velodyne VLP-16 LiDAR from a sidewalk area.There are 2130 and 1733 3D bounding box labels for cars and pedestrians in 1096 scans.The height of the sensor mount is 0.5 m from the ground.We want to emphasize that this dataset is used only for testing, not training.The detailed specifications are listed in Table 1.Set 3. The SemanticKITTI [38] is used to measure the accuracy of the ground segmentation methods.It includes point-wise labels for the ground surface in traffic scenarios and thus is ideal for our experiment.

Overall performance
The performance on the KITTI dataset is presented in Table 2.For the comparison, we chose state-of-the-art methods both in terms of speed and accuracy [9,15], and frequent methods that are performing on a similar level to our method [3,4,[40][41][42].They use voxel and bird's eye view modalities, which make a fair comparison as our method utilizes voxels and point-based methods.Moreover, all methods are implemented using PyTorch [43] to have a more fair comparison.
Our method achieves similar AP on pedestrian and cyclist detection to other methods, which is impressive considering the computational cost of our approach.Table 2 compares the performance in terms of FPS and mAP.Our method is the only one that achieves real-time performance on a CPU.
Figure 6 illustrates the separability between ID and OOD samples in the classifier.The plot includes 10 4 samples from IDs and OODs, respectively.In the dataset,  the actual partitions of IDs and OODs are approximately 6% and 94%, respectively.The energy distribution of cars is much narrower compared to other classes because the amount of car samples in the training set is more significant compared to other classes.This allows the network to learn the difference between cars and OODs better.Figure 7 shows the energy distributions in the bounding box estimation network.The plot includes 10 4 samples from IDs and NOODs, respectively.The energy distributions have notable differences depending on the class and the vector type.This is a significant result, given that the classifier falsely detected these NOOD samples as ID samples.Car class has the best ID/NOOD separability.We suspect this is due to a large sample count in the training data compared to pedestrian and cyclist classes.

Performance on the 16-channel LiDAR dataset
Models are trained with the KITTI training set.Performance is tested with our annotated 16-channel LiDAR data.Therefore, this study investigates the performance of models on a low-resolution point cloud trained with a high-resolution point cloud.Furthermore, labels span full 360 • in our dataset, unlike in the KITTI dataset, where labels are limited to approximately 90 • sector.Our method performs on par with the state-ofthe-art while achieving real-time performance on the CPU (Table 3).Our method performs the best in the study that measures the performance difference from the KITTI dataset to the low resolution and low perspective point cloud dataset (Table 4).

Ablation study
Table 5 illustrates the contributions of the proposed modules to the mAP and FPS.Modules improve the mAP and FPS significantly.The first ID pass-through module improves the speed significantly, as it reduces the samples processed by the box estimator.Based on the ablation of the PVLE modules, the location carries valuable information regarding the 3D road user detection task, which satisfies the hypothesis discussed in the methods.The PVLE module for the box estimator slightly improves the car and cyclist classes while worsening it for the pedestrian class.This suggests that the module can be prone to overfit if the total number of samples is small, as it is for the pedestrian class.

Ground segmentation
The performance test results are summarized in Table 6.Our ground segmentation method is compared to frequent and state-of-the-art methods in the literature.It performs well in terms of computational cost, accuracy, and IOU.This is due to our effective sampling method, which reduces the iterations needed in the RANSAC function.Furthermore, fitting multiple planes in sensor azimuth direction yields more accurate segmentation, especially on an uneven ground surface.

Qualitative results and discussion
For qualitative analysis, we have randomly picked detection results.They are visualized in Fig. 8. Videos displaying the detection performance can be found at. 1 The geometrical proposal generator reduces the computational requirement significantly while still achieving mAP comparable to the state-of-the-art.The trade-off suggests that not using learned proposals is justified.The proposal generator has another benefit, too.It allows data streaming, meaning that the point clouds can be processed in smaller sectors to start the processing earlier than in whole scan approaches.This will decrease the latency of the detection significantly.The limitation of the proposal generator is cluster fusion when physical contact of the road users is visible to the sensor.This could be solved using another proposal generator, such as furthest point sampling.However, this is not the accuracy bottleneck of our approach since the performance increases when the IOU threshold is decreased.This is especially apparent with the car class, which has a harsh 0.7 IOU threshold.Although the car class separated better from the OODs than the pedestrian and cyclist classes, the final bounding box predictions were worse with the  How does the OOD training objective affect the accuracy of the ID classification and bounding box estimation?By adding an energy-based OOD training objective, networks learn not only the original task but also the energy-based task.This decreases the performance of the original task slightly.However, many false positives are removed using the energy values, which increase AP more than rare classification errors decrease it.
Is the inductive bias of the proposal voxel location encoder beneficial?The bias of the module is beneficial as the voxel grid resolution is relatively low.This results in more proposals for a single voxel location, which results in a more general representation of the location.Hence, the model is not prone to overfit to voxel location information.This is indicated by the results in Table 5.

Conclusion
This paper presented a novel architecture for the 3D road user detection task.The architecture has an extremely low computational requirement; therefore, it is suitable for applications with limited computational resources.An impressive 15.2 FPS was achieved with a 4.0 GHz CPU-only implementation while having comparable accuracy to the state-of-the-art.Furthermore, our architecture performed the best on the low-resolution LiDAR dataset.The architecture is based on a geometrical proposal generator and outof-distribution-and location-aware PointNets.To our surprise, the accuracy bottleneck was not the proposal generator but the bounding box estimator.In the future, improvements to the bounding box estimator could be carried out, and other OOD detection methods could be studied in the 3D road user detection task.

Fig. 1
Fig.1The proposed architecture.The input point cloud is organized by mapping : R n×3 � → R s h ×sw ×3 .Then, the ground segmentation, coupled with a clustering algorithm, generates simple proposals fed into the classifier neural network.Then, the first ID pass-through module discards coarse OOD proposals, which enables low computational requirements for the box estimation network.Similarly, the second ID pass-through module discards boxes that are OOD.In parallel, the PVLE encodes the locations of the proposals and feeds them into the classifier and the box estimator.The final output of the pipeline is 3D bounding boxes and class probabilities for the objects of interest

Fig. 3
Fig. 3 Proposed classifier PointNet is shown in (a) and the bounding box estimator PointNet is shown in (b)

Fig. 4
Fig. 4 Proposal location normalization is shown in (a) and the proposal voxel location encoder is shown in (b)

Fig. 5
Fig.5 The mobile robot platform equipped with a 16-channel LiDAR and the environment used for data collection

Fig. 6
Fig. 6 Energy distributions from the out-of-distribution-optimized classifier display ID/OOD separability of each class

Fig. 7
Fig. 7 Bounding box estimator energy distributions with the L box−energy term in the training objective.The distributions display the NOOD/ID separability of each vector pair

Table 1
The numbers of 3D bounding box labels in our 16-channel LiDAR point cloud datasetThis set is only used for testing

Table 2
3D detection on the KITTI dataset

Table 3
3D detection on the 16-channel LiDAR mobile robot dataset

Table 5
Ablations of different modules and their contribution to the AP and the FPS on the KITTI dataset PVLE: proposal voxel location encoder, IDP: ID pass-through module.When IDP(cls) is -, it is replaced with a typical softmax confidence threshold car class.Therefore, the accuracy bottleneck is in the bounding box estimator network caused by incorrectly predicted bounding boxes on correctly classified proposals.Thus, improvements to this network would be a good subject of interest in future research.