Skip to main content

Attribute annotation and bias evaluation in visual datasets for autonomous driving

Abstract

This paper addresses the often overlooked issue of fairness in the autonomous driving domain, particularly in vision-based perception and prediction systems, which play a pivotal role in the overall functioning of Autonomous Vehicles (AVs). We focus our analysis on biases present in some of the most commonly used visual datasets for training person and vehicle detection systems. We introduce an annotation methodology and a specialised annotation tool, both designed to annotate protected attributes of agents in visual datasets. We validate our methodology through an inter-rater agreement analysis and provide the distribution of attributes across all datasets. These include annotations for the attributes age, sex, skin tone, group, and means of transport for more than 90K people, as well as vehicle type, colour, and car type for over 50K vehicles. Generally, diversity is very low for most attributes, with some groups, such as children, wheelchair users, or personal mobility vehicle users, being extremely underrepresented in the analysed datasets. The study contributes significantly to efforts to consider fairness in the evaluation of perception and prediction systems for AVs. This paper follows reproducibility principles. The annotation tool, scripts and the annotated attributes can be accessed publicly at https://github.com/ec-jrc/humaint_annotator.

Introduction

Autonomous Driving Systems (ADS) rely on accurate detection of persons and vehicles, as well as prediction of their future actions and motions, in order to operate safely and efficiently. Indeed, perception and prediction (a.k.a. dynamic scene understanding) is one of the key operational layers of Autonomous Vehicles (AVs) [1]. This task primarily utilizes information obtained from digital cameras and range sensors, such as LiDAR or radar. Generally, the most sophisticated perception and prediction systems are built upon deep learning models, which significantly depend on datasets for their development and evaluation [2]. Although these approaches are becoming increasingly advanced and effective, current design and evaluation methods overlook the ethical principle of fairness. In the machine learning domain, this principle is generally interpreted as equality of opportunity [3], or unfair bias [4], and it refers to the potential variation of performance levels across different population demographics [5].

The risk of unequal performance can be attributed to two main components. Firstly, the bias in datasets. The performance of machine learning models is heavily influenced by the number of training samples. Therefore, categories that are underrepresented in datasets will inherently pose a greater challenge for the model to learn [6, 7]. Secondly, the lack of algorithmic fairness that aim to counteract the bias in the datasets. Algorithmic solutions encompass various strategies, including different pre-process, in-process and post-process mechanisms [8]. However, as stated in [9] algorithmic interventions alone are unlikely to be the most effective path towards fairness in machine learning, making dataset interventions necessary. Furthermore, most algorithmic approaches are supervised and require explicit annotation of protected attributes, which further underscores the need for intervention at the dataset level [9].

Key attributes for studying bias in visual datasetsFootnote 1 for autonomous driving may include sex/gender, age, or skin tone for pedestrians, and colour, type, or size for vehicles. For example, a dataset that primarily features lighter-skinned people [10] or certain age groups [11] may not accurately reflect the diversity of people on the road in other scenarios. Similarly, a dataset that primarily features small size vehicles of certain colours may not accurately represent the range of vehicles encountered in other real-world situations. The consequences of dataset bias in the field of autonomous driving can be significant. Inaccurate detection of certain type of pedestrians or vehicles, as well as imprecise motion prediction, can result in varying behaviours of autonomous vehicles towards different road agents [5]. This could potentially lead to disparities in accident and injury rates, thereby producing an unfair outcome.

Despite its importance, the study of bias in the context of perception for AVs has been relatively neglected in the literature, with some exceptions [10,11,12]. This is not the case in the computer vision community, where there is a considerable body of work focused on addressing the issue of underrepresentation of certain demographic groups in datasets [13].

In the general field of machine learning, proposed standards for model [14] and data [15] transparency reporting advocate the disclosure of performance metrics, disaggregated by relevant population groups. Furthermore, numerous strategies to mitigate bias require labelling attributes in at least a portion of the training dataset [16]. To categorize different demographic groups, it is necessary to have a certain level of granularity in the attributes of the agents. In terms of fairness, some attributes may be protected, that is, the performance of the model should be agnostic with respect to them [3].

Attribute labelling is thus a crucial step in identifying biases in visual datasets. The annotation of attributes such as age, sex/gender, and skin tone in person datasets, or car model in vehicle datasets, can pose significant challenges. These challenges stem from the subjective perception of these characteristics, potential biases of the annotators (including unconscious biases), and the ethical and privacy considerations involved in the process. Therefore, the annotation procedure plays a pivotal role, especially when labelling large amounts of data that require multiple annotators working in parallel on non-overlapping datasets.

The main contributions of this paper are as follows:

  • We present a new set of annotations for subsets of some of the most commonly utilised visual datasets for training perception and prediction systems for different road agents in the autonomous driving domain. We focus on both person and vehicle datasets that meet an inclusion criterion, taking into account minimum image resolution and diversity. Attributes of over 90K persons and more than 50K vehicles have been annotated, including previously unconsidered attributes such as the individual’s mean of transport or group assignment. As depicted in Table 1, the annotations provided in our work markedly surpass the scale and variety found in previous works.

  • We introduce a specialised annotation tool that allows multiple users to simultaneously annotate equity-relevant attributes of road agents in visual datasets for AVs, including the proposal for a common standardised file format for the recorded data and associated attributes.

  • Additionally, we have developed an annotation methodology designed to minimize common errors, biases, and discrepancies among annotators. We have validated this methodology through an analysis of inter-rater agreement, ensuring its reliability and effectiveness, and identifying the most sensitive issues when labelling such attributes in this context.

  • Finally, we present the final distribution of attributes across all datasets, and identify the most important biases.

In addition, it is important to highlight that our contribution is not focused on the general domain of image annotation tools, but rather on the specific context of annotating equity-related or protected attributes of road agents using available visual datasets that have already been annotated. Numerous tools for both region (e.g., bounding boxes) and semantic labelling (e.g., pixel level) have been available for a long time [17], offering a variety of functionalities (e.g., collaborative labelling, video annotation), metadata and type of labels [18]. However, to our knowledge, there is no specialised tool capable of reading the wide range of label formats found in the most significant visual datasets in the context of autonomous driving, allowing concurrent annotation of possibly inequity attributes, and the establishment of the minimum levels of overlap required for inter-rater agreement analysis. Moreover, our tool cannot be dissociated from the methodology proposed for the annotation of these attributes, so the main contribution comes from the combination of both.

The annotation tool, scripts and the annotated attributes are publicly available. The presented work serves as a significant step towards future work in examining biases and addressing potential measures to ensure fairness of perception systems for autonomous vehicles.

Related work

In this section, we review the principal studies related to attribute annotation and bias analysis in datasets within the broader context of computer vision, and specifically in the case of autonomous driving. While this topic has also been explored in other non-visual domains [19], such analysis falls outside the scope of this paper as we focus on the annotation of visual attributes.

Table 1 Comparison of person and vehicle annotated attributes in visual datasets in the autonomous driving domain

Computer vision

The problem of dataset bias has been widely addressed in the computer vision context in multiple application domains [13]. For example, in face recognition and gender classification [20], face attribute detection [21], facial expression and emotion recognition [22, 23], human activity recognition [24], fine-grained vehicle type classification [7], or image captioning [25], among others.

Similarly, numerous studies focus on bias-aware annotation of protected attributes in new or pre-existing datasets. In [20], the authors released a new balanced dataset, the Pilot Parliaments Benchmark (PPB,) by collecting and annotating data related to gender and the skin tone from photos of members from six different parliaments, with a total of 1.2K images. They also annotated these attributes in two existing datasets, IJB-A and Adience. A subset of about 108K images, extracted from the YFCC-100 M dataset, was utilised in [26] to label race, gender and age attributes for face recognition systems. They used crowd workers for the annotation process, with three annotators per image. Consensus among annotator determined the ground truth. Disagreements led to reassignment or discarding of images. A model trained on these initial annotations was used to refine them, with manual re-verification for any discrepancies. As stated in [13] this process does not guarantee correct annotations. The same YFCC-100 M dataset was used in [27] to create a subset of almost 1 M images including age, gender, skin tone, a set of craniofacial ratios, pose and illumination as protected attributes. However, they used automatic detection tools for the annotation process, which incorporated the biases of the datasets used to train these tools. Similar attributes were manually annotated for a smaller number of images (approximately 40K) in [28], also in the context of face analysis.

In [9], the authors employed crowdsourcing to manually annotate gender, skin colour and age attributes from the “person” subtree of the ImageNet dataset. Each image was annotated by at least two workers and consensus was necessary to accept the annotations. They also implemented a quality control process to exclude workers with high error rates. Similarly, in [16] a randomly selected subset of images from the Open Images dataset of about 100K images was annotated for the ‘person’ class, with a focus on attributes of gender and age. Specifically, the categories of person, man, woman, boy and girl were selected. The annotators were instructed about the different labels, including instances where the ’unknown’ label might be the appropriate choice. In [29], a new dataset was introduced, comprising videos from over 3K subjects. The subjects self-annotated the attributes of gender and age, while a group of annotators labelled the skin tone.

Finally, it is worth noting that although there are specific tools for the analysis and detection of automatic biases in visual datasets, such as Amazon SageMaker Clarify [30], Google’s Know Your Data [31], or more recently, the REVISE tool [32], most of the aforementioned works did not specify the particular annotation tool used for the dataset bias identification process. The use of specific ad-hoc tools for each case is the most common approach. Moreover, most studies reviewed have somewhat neglected a thorough analysis of annotator reliability, such as inter-rater agreement, particularly when multiple annotators label the same or distinct data.

Autonomous driving

In the field of autonomous driving, there is a limited body of work focusing on fairness or bias analysis. In [11], the author augmented the annotations of the INRIA Person Dataset by including gender (male or female) and age (child or adult) categories. The findings highlighted a significant underrepresentation of the child category compared to adults, and a lesser number of female agents compared to male ones.

In [10], the authors annotated pedestrians in the BDD100K dataset according to two different skin tones (light or dark). They used bounding boxes of a minimal size and collected the annotations using crowd workers for the training set (\(\sim 5000\) pedestrians), while using their own annotations for the validation set (\(\sim 500\) pedestrians). For the crowdsourced annotations, they only took into account those agents where a consensus was reached. They showed that the annotated subset of the BDD100K dataset contained more than three times as many individuals with light skin compared to those with dark skin. Recently, the same dataset was utilised in [32] to conduct a geography-based bias analysis, which was correlated to income and weather conditions.

Finally, we highlight a recent study [12] that focused on gender, age and skin tone as protected attributes in the following pedestrians datasets: CityPersons, Eurocity Persons (Day and Night) and BDD100k. Two annotators independently labelled gender (male or female) and age (adult or child) for the same agents with certain bounding box size. They annotated approximately 16K and 20K labels corresponding to gender and age, respectively, and use the available annotations for skin tone from the BDD100k dataset [10]. They found a slight difference in the representation of males compared with to females, and a significant bias towards adults compared to to children across all datasets.

To better highlight the contributions of our work in the context of autonomous driving, in Table 1 we compare the available annotations of person and vehicle attributes in visual datasets. As can be seen, the scale and breadth of our annotations substantially improve upon previous efforts.

Datasets

Current state-of-the-art scene understanding or visual perception systems for autonomous vehicles are learning-based or data-driven, and rely heavily on the availability of public datasets. In addition, these datasets serve as the basis for various benchmarking processes that are very useful for the academic and industrial community. Unlike previous works that have focused the bias analysis only on pedestrian detection, and using a single dataset (i.e., the INRIA Person Dataset in [11] and the BDD100K in [10]), we propose to broaden the scope by also considering the detection of cyclists and vehicles, and including multiple datasets. Thus, our goal is more ambitious, as we aim to develop a tool and validate a methodology that can be applied holistically to annotate additional attributes in any visual dataset for the autonomous driving domain. With this approach, we also obtain more general bias analysis results.

We have selected up to six different datasets that are highly representative of the current computer vision data ecosystem in the field of autonomous driving. Our inclusion criteria for the selection of the datasets take into account several parameters, such as minimum image resolution, and diversity in geographical distribution and type of agents, including datasets from Europe, the US and China, and agents such as pedestrians, riders, users of personal mobility devices, wheelchairs and different types of vehicles.

It is important to note that most datasets were published as a challenge for researchers with an associated benchmarking process. In such cases, annotations (i.e., bounding boxes) are only available for training and validation datasets, so attributes cannot be annotated for test datasets. The six datasets that have been selected for the annotation and bias study process are briefly described below. An overall but detailed comparison is presented in Table 2, and the general distributed is depicted in Fig. 1.

Table 2 Comparison of person and vehicle annotated datasets in the autonomous driving domain

KITTI dataset

The KITTI Dataset is probably one of the most well-known datasets in the context of autonomous vehicles. Presented in 2012 [33], it provides a suite of vision tasks such as stereo, optical flow, visual odometry, etc., built using an autonomous driving platform. It contains the object detection dataset, including mid-resolution images (1240 \(\times\) 376) and bounding boxes for pedestrians and vehicles. It was recorded in the city of Karlsruhe and nearby areas in Germany. The image resolution, number of agents and diversity are very limited. Although it does not reflect the current state of the art, this dataset is by far the most widely used in this domain. For the object detection task, the training set provides the location of about 6336 agents corresponding to people, including pedestrians, sitting persons and cyclists, and about 33261 vehicles, including cars, vans trucks, and tram categories.

Tsinghua-daimler dataset

The Tsinghua-Daimler (TDC) dataset mainly focuses on cyclists and other riders. It was presented in 2016 [34], and recorded using a moving vehicle in the urban traffic of Beijing in China. In this case, the labels are provided for the training, validation and test datasets, including a total of about 31115 agents corresponding to the person category, mainly including cyclists, but also pedestrians.

Fig. 1
figure 1

Original distribution of annotated samples for the persons and vehicles categories

CityPersons dataset

The CityPersons dataset was presented in 2017 [35] as a new set of person annotations on top of the Cityscapes dataset [39]. Citiscapes, originally presented in 2016, contained a diverse set of stereo video sequences recorded in 50 different cities. CityPersons annotations correspond to a subset of 27 different cities (most of them from Germany), and it includes a training and validation datasets with about 23089 bounding boxes and labels for persons, including pedestrians, riders, sitting persons and others (unusual postures).

EuroCity persons dataset

The EuroCity Persons dataset, presented in 2019 [36] was specifically conceived to provide highly diverse annotations of pedestrians, cyclists and other riders in urban traffic scenes, including 31 cities of up to 12 different European countries. It also provides data in night-time conditions. Including daytime and nighttime conditions, the training and validation datasets contain about 171145 annotations corresponding to persons.

BDD100K dataset

The Berkeley Deep Drive dataset (BDD) was presented in 2020 [37]. It consists of over 100K video clips recorded from more than 50K rides covering multiple cities and nearby areas in the US (e.g., New York, Oakland, Berkeley, San Francisco, San Jose, etc.). It includes up to six different weather conditions, and three different times of day, including nighttime sequences. For each video, they provide bounding box annotations for one reference frame, including pedestrians and riders for the person category, cars, trucks, buses, trains and motorcycles for the vehicle category, as well as other elements such as traffic lights and signs. In total, the training and validation datasets include about 109777 and 875024 samples from the person and vehicle categories respectively.

nuImages dataset

The nuTonomy Scenes (nuScenes) dataset was recorded in Boston (US) and Singapore. Presented in 2020 [40], it was conceived as the first multimodal dataset providing data from a complete autonomous vehicle sensor suite (6 cameras, 5 radars, and 1 lidar). The nuScenes dataset was complemented with nuImages dataset the [38], which offers a total of 93K 2d annotated images from the 6 cameras, with higher variability. These annotated images encompass conditions such as rain, snow and nighttime, and provide up to 23 different classes of objects. This includes pedestrians and riders within the person category, and cars, trucks, trailers, motorcycles, and buses within the vehicle category. It also offers object attributes, some of them very well aligned with our annotations such as adult, child, personal mobility device or wheelchair. This is a very large-scale dataset that contains a huge number of annotations. The training and validation datasets include about 165587 of annotated samples for the person category, and about 332637 for the vehicle one.

Fig. 2
figure 2

Distribution of annotated samples filtered by area for persons datasets

Fig. 3
figure 3

Distribution of annotated samples filtered by area for vehicles datasets

Data selection

Manual data annotation is a costly and time-consuming process. The manual effort and associated costs required to label images also depend on many factors, such as the complexity of the images, the number and type of agents, the annotation attributes, etc. [41]. In this project, taking into account the available resources and the number of attributes required to be annotated per agent, we were able to target an approximate number of \(\sim\)140K agents plus a minimum of \(6\%\) of the samples annotated by at least 5 different annotators to carry out the inter-agreement analysis. One of our hypotheses is that possible biases will be more prevalent for persons. In addition, the number of data sets for persons is larger. Therefore, we assigned roughly 2/3 of the target samples to the people class (\(\sim\)90K) and 1/3 to the vehicle class (\(\sim\)50K).

For attribute labeling to be effective, agents must be visible at a reasonable resolution. There is preliminary evidence that, in the case of small image sizes, the level of disagreement between several annotators in labeling attributes (such as skin tone) can be considerably higher [10]. As the number of agents available in all datasets is well above the target of 140K, we first studied the distribution of the samples according to the size of the bounding box area. For the person category we counted the number of agents for bounding box sizes greater than or equal to 3000, 6000 and 10000 pixels. For the vehicle category, our preliminary analysis showed that with smaller sizes it was possible to clearly identify the attributes under study, so we analyzed those samples with bounding box areas greater than or equal to 2000, 5000 and 8000 pixels. The distribution of annotated samples filtered by area for persons and vehicles datasets are depicted in Figs. 2 and 3 respectively. As can be seen, the relative number of samples of persons with a resolution higher than 10000 pixels is particularly low for the BDD100K and nuImages datasets, which can be explained by the lower image resolution used (see Table 2).

The specific distribution of annotated samples filtered by area for the persons and vehicles datasets is shown in Tables 3 and 4 respectively, including the tentative final targets, the coverage with respect to the available samples for the selected bounding box area, and the minimum number of samples to be annotated by the annotators to perform the inter-rater reliability analysis. On the one hand, for the vehicles dataset, the number of samples available with the maximum bounding box area (8000 pixels) is more than enough to reach a minimum of 50000 samples. Therefore, we filtered out all samples with an area less than 8000 pixels (a minimum square bounding box size of about 283 \(\times\) 283 pixels). On the other hand, for the persons dataset, the number of samples with an area greater than 10000 pixels (as in [10] for skin tone annotation), is less than 75000 samples. Therefore, to reach the goal of 90000 samples, it was necessary to reduce the filtering threshold to 6000 pixels (a minimum bounding box size of about 173 \(\times\) 346 pixels). In a preliminary study, we found that this size was sufficient to correctly identify the attributes under study, provided that visibility conditions were adequate. With the aforementioned thresholds, we attempted to annotate on average 83% of the person samples with a bounding box area greater than 6000 pixels, and 16% of the vehicle samples with a bounding box area greater than 8000 pixels. We depict two different examples corresponding to different bounding box sizes for the persons and vehicles datasets in Figs. 4 and 5 respectively. As can be observed, the smaller resolutions do not allow to clearly differentiate attributes such as sex or skin tone for persons, or type and colour for vehicles.

The large number of samples to be labelled requires the involvement of multiple non-overlapping annotators, labelling different subsets of the datasets (one-way model [42]). In total, 5 different annotators have been involved. However, before carrying out the non-overlapping labelling, it is necessary to assess the degree of agreement between the different annotators, and once validated, to continue with the labelling of the remaining samples. For this purpose, and based on the available resources, a minimum of 6% of the total number of samples was defined to be labelled by all annotators (two-way model [42]). As shown in Tables 3 and 4, this amounts to a total of 5400 labelled samples per annotator for the person databases (27000 annotations in total) and a total of 3000 samples for the vehicle databases (15000 annotations in total). These samples will be used to perform the inter-rater reliability analysis.

Table 3 Distribution of annotated samples filtered by area for persons datasets, including the final goal and the number of required annotations per annotator
Table 4 Distribution of annotated samples filtered by area for vehicles datasets, including the final goal and the number of required annotations per annotator
Fig. 4
figure 4

Two examples of different sizes of bounding boxes for persons datasets (resolutions and areas). Smaller resolutions do not allow for clear identification of sex or skin tone

Fig. 5
figure 5

Two examples of different sizes of bounding boxes for vehicles datasets (resolutions and areas). Smaller resolutions do not allow for clear identification of vehicle type or colour

Annotation

The description of the annotation tool and methodology goes here.

Data formatting and annotation tool

For this work, we created a dedicated tool that enables multiple users to annotate additional attributes of agents (i.e., persons and vehicles) in visual datasets designed for autonomous driving. The tool was designed to seamlessly interact with the datasets described in "Datasets" Section. The configuration of the tool requires the pre-definition of the final goals, the number of annotators, and the minimum number of samples for inter-rater agreement analysis for each dataset (Tables 3 and 4). After these requirements are pre-defined, the annotation process becomes fully transparent to annotators for both inter-agreement and non-overlapping samples. That is, once one of the users has reached the minimum number of inter-agreement samples, the next annotated samples for that user are automatically assigned to the non-overlapping set.

There is currently no standardised format for storing annotation data for these datasets. Each dataset uses its own format for storing image labels and metadata. Therefore, we cannot build a single interpreter to interact with all datasets. To make the data more portable, easy to read and use, and hopefully establish a standard for future datasets, we created a specific parser to convert all datasets annotations data files to a common dictionary organizing the data in JSON files. The parser is publicly available, and the proposed standardised data format for image metadata, generic agent data and new attributes is detailed in Appendix D.

The annotation tool was implemented using a backend based on Flask, which mainly serves as a communication interface between the database, the frontend and the storage. The frontend was developed in Javascript as a web application. It includes all visualisation and user interaction functionalities. The entire system was deployed on an internal local server. The tool is publicly available.

Broadly speaking, the annotation procedure implemented in the tool is as follows. Once the user has logged into the annotation web interface and selected a dataset, the tool automatically provides images of the dataset to be annotated. As can be seen in Fig. 6, the interface displays the bounding box of each agent that meets the minimum size criterion. By clicking on each agent or using the agent tabs below the image, the user can select the agent and assign different attributes to it from the attributes menu. In addition, a zoom function is available to inspect the image in detail. Once all agents have been annotated, the system will provide a new image, until the final target for each dataset has been met.

Fig. 6
figure 6

Annotation tool interface. Example of a persons dataset. All active agents (minimum size) must be annotated by clicking and selecting the different attributes. In this example agents have been pre-configured in two different groups. The tool is available at: https://github.com/ec-jrc/humaint_annotator

Finally, there are two features that have been automated in the tool. First, the pre-assignment of groups of agents, which is based on a similarity criterion according to the position and size of the agent’s bounding boxes. This automatic assignment can be manually corrected during the annotation process. Second, the automatic copying of agent attributes into sequence in the nuImages dataset. In this case, the key frame is labelled, and the same attributes are copied to the rest of the frames in the sequence from the unique agent identifier.

Attributes

In this work, we focus on the intrinsic attributes of agents that have an impact on vision-based perception and prediction systems. The appearance of the agent is the most prominent feature typically used by detection systems of this type, which learn from samples in datasets collected from real-world driving. The attributes of agents, including their appearance, can have a significant effect on the functioning of the perception system. Therefore, if the datasets used to train the model are not well balanced, the performance of the system may be compromised depending on the type of agent.

This effect can also occur in systems for predicting the motion and actions of agents. Most predictive perception systems are based on learning an agent behaviour model from data. The behaviour of the agents, including their possible future actions and motions, depends on both intrinsic and extrinsic factors [43]. While our focus is on intrinsic features that may affect agent behaviour, other extrinsic factors such as scene layout, traffic, lighting, and weather conditions also play a key role.

In the following, we describe the attributes considered for the two main types of agents: persons (see Fig. 7) and vehicles (see Fig. 8).

Persons datasets

Age This attribute can have an impact on both the appearance [11] and behavior model [44] of a person-type agent. Given the typical image resolution available for pedestrian detection, bounding box size is the most important appearance factor, which is quite noticeable between children and adults. That is, for the same distance, the size of the pedestrian projected in the image is significantly smaller for children than for adults, and pedestrian detection systems are well known for reporting worse results for small agent sizes [45]. With respect to behavior, numerous studies have identified significant differences between children and adults [46], and in some cases, with respect to elderly people [47]. However, like in [11], only child and adult categories have been considered since identifying elderly people compared to younger adults is quite challenging for both an annotator and a perception system [48], especially at medium or low bounding box resolutions.

Sex/gender Consistent with previous studies [11, 20], we assume this attribute as a binary variable according to traditional or stereotypical male/female physical appearance and morphology. The visual appearance of individuals is influenced by many factors, including both sex (e.g., sexual dimorphism, body and facial morphology [49]) and gender (e.g., clothing, hairstyle).Footnote 2 Significant discrepancies in sample distribution between female and male pedestrians may affect the performance of vision-based detection systems [11]. Furthermore, there is evidence that this attribute can affect pedestrian behavior as well [50], and thus, any potential bias in the data can significantly impact the accuracy of predictive systems.

Fig. 7
figure 7

Distribution of labelled persons attributes

Fig. 8
figure 8

Distribution of labelled vehicles attributes

Skin tone This attribute refers to skin pigmentation of person-type agents. As proposed in [10], we broadly divide the Fitzpatrick skin type scale [51] into two groups: light skin tone corresponding to types I-III and dark skin tone corresponding to types IV-VI. Higher levels of granularity are not necessary for the purpose of the analysis and would be too complex to annotate accurately. Previous evidence on skin tone for pedestrians indicates that when individuals with dark skin tone are underrepresented in datasets, pedestrian detection systems tend to perform less accurately compared to the detection of light skin tone individuals [10]. However, this attribute is less relevant for predictive systems as, to the best of our knowledge, there is currently no evidence indicating different behaviors depending on skin tone.

Group/individual Pedestrians can move either individually or in a group (i.e., two or more people moving together). This attribute can affect both detection and prediction. On one hand, when moving in a group, it is more likely to have different degrees of overlap, meaning some parts of pedestrians may not be fully visible from the camera, or the main region of a person may contain parts of other persons. These effects may affect the performance of the vision-based perception system, resulting in poorer performance in detecting people in a group compared to individuals [52]. On the other hand, there is evidence that when pedestrians are moving or crossing as a group, they tend to behave different than when they are alone [53]. For example, accepting shorter gaps between vehicles to cross, or not looking at upcoming traffic. Group size exerts some form of social force over individual pedestrians [54]. It also affects pedestrian flow and speed. Therefore, action and motion prediction systems will be affected by this attribute.

The annotation tool automatically pre-assigns agents to groups, each with a unique identifier. However, in the event of an error such as two close pedestrians crossing in opposite directions, annotators can undo the pre-selection, remove groups and re-assign agents to different groups as needed.

Means of transport Appearance, motion dynamics, and decision making differ according to the means of (non-motorized) transport used by the person. This category includes pedestrians, wheelchair users, bicyclists, and users of Personal Mobility Devices (PMD) such as electric scooters, hoverboards, unicycles or segways [55]. Therefore, this attribute impacts both detection and prediction systems.

Vehicles datasets

Vehicle type The appearance of vehicles is influenced by the type of vehicle, affecting its shape and size. There is evidence of bias in the detection and classification of vehicles when certain types of vehicles are underrepresented in the datasets [7]. But the type of vehicle also influences the dynamics of movement, with behaviours that can be very diverse (for example, between a motorcycle and a bus). As depicted in Fig. 8, the following categories have been considered for this attribute: car, motorcycle, van/minivan, truck, bus and others. An additional layer is used for the car category, which includes different car types and segments (more details can be seen in Table 9).

Vehicle colour Apart from shape and size (vehicle type) the only attribute that influences the appearance of the vehicle for a specific pose as seen from the camera is the colour. Colour, as a feature, has been used to detect vehicles [56], and vehicle colour recognition is a classic problem within the field of vehicle detection and identification [57], which shows that it is a sufficiently distinctive feature. On the other hand, although preliminary evidence suggests a correlation between vehicle colour and crash risk [58], this effect cannot be directly attributed to driver behavior. Rather, it is linked to the visibility of the vehicle itself and is closely related to appearance and detection. Therefore, this attribute is not considered relevant for predictive systems.

The colour labelling of vehicles is highly dependent on lighting conditions and camera characteristics, conditions that change both at the dataset level and between datasets. After several preliminary studies, it was decided to limit the number of colour categories to eight: black, white, grey, blue, red, yellow, green and others (see Fig. 8). A higher level of granularity would result in larger inter-rater disagreements and labelling errors.

Uncertain and conflicting cases

We have provided the annotator with the category “unknown” for all attributes, except for the group, in anticipation of the possibility of uncertain or conflicting decisions. In this context, “uncertain” refers to situations in which, even with the zoom functionality, it is extremely difficult to select a specific attribute, such as the vehicle colour or skin tone during nighttime conditions. Additionally, the assignment of vehicle and car types can be challenging in certain cases, depending on the perspective and vehicle pose. This uncertainty is also present in the boundaries between categories, such as adult/child or car types. On the other hand, “conflicting decisions” refer to cases where the annotator feels that some kind of stereotypical decision is being made based on his or her own prejudices. For example, in some cases the attributes of sex/gender, age or skin tone are based on certain stereotypical characteristics that may not correspond to reality. As explained in "Persons datasets" Section, we allow the annotator to freely select “unknown” in such cases or make a decision based on such stereotypes.

Annotation methodology

There is growing concern and awareness of widespread problems in the annotation of commonly used image benchmarks, which are subject to errors and biases [59, 60]. Recent work highlights the importance of labelling instructions for high quality annotation [61].

To establish a common guidelines and methodology on the different annotation attributes for all annotators, we implemented the following procedure (see Fig. 9). First, a basic tutorial was created for annotators to interact with the annotation tool and identify the different attributes. Next, an initial control set of annotated samples was prepared that was as representative as possible of each dataset, while maintaining a manageable size. To define the control set, a random but representative subset of all datasets (15 images each) was selected from the available annotation tool. Based on the instructions and the control set, a synchronous labelling session was developed in which all annotators and the coordinator proceed to label the control set at the same time in the same session, on the same projected screen. For each image, each annotator manually wrote down the attributes of the agents on a piece of paper, and all results were immediately shared, compared and discussed.

This synchronous labelling session pursued two main objectives. First, to instruct the annotators in the annotation process. At first, the differences in annotation criteria were greater, but after comparing and discussing various images, a consensus began to emerge, and the process went much faster as there were fewer and fewer differences in annotation. And second, to gather information on the consensus reached in order to incorporate it into the general annotation guidelines (see Appendices A-C). All annotators should follow these guidelines to maximize the quality of the annotations, ensuring consistency between different annotators (high inter-rater agreement) and among annotations by the same annotator (high intra-rater agreement).

Fig. 9
figure 9

Proposed approach to develop the annotation methodology and guidelines

Results

The annotation process included to annotate a control group of 5400 and 3000 agents for persons and vehicles respectively (the so-called inter-agreement) that was annotated by all the 5 annotators. The remaining agents were annotated only once by a single annotator. This inter-agreement was used, on a first phase, to detect biases among annotators and take corrective actions. On a second phase, it was used to perform an inter-rater analysis and evaluate the expected agreement on the annotations of the remaining agents.

Inter-agreement results

To measure the inter-rater agreement we selected two different metrics. The first one, the percentage of disagreement, represents the probability of an agent of being labelled not unanimously, per attribute. It is defined as the number of agents labelled differently over the total number of agents labelled.

$${PD_i} = \frac{1}{{N}\, {T_i}}\sum_{n=1}^{N}{{T}^{D}_{n,i}}$$
(1)

where i is the annotated atributes {Age, Group, MoT, Gender, Skin} for pedestrians and {VehType, colour, CarType} for vehicles, N is the total number of annotators, \(T^{D}_{n,i}\) is the number of agents labelled differently to any other annotator by annotator n for label i and \(T_i\) is the total number of agents annotated for label i.

Table 5 Average percentage of disagreement per label and dataset

Results displayed on Table 5 can be interpreted as the expectancy of a disagreement on labelling each one of the labels on each one of the datasets. However, those figures have been calculated in a pessimistic way, as all labels have been considered to have the same certainty, including “unclear” and “not clear”. Disagreements where some of the raters label “unclear” could have been considered to contribute less to the overall disagreement than those where opposite labels are chosen (ie. male and female).

The second one, Fleiss’ Kappa [62], is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to several items. It is a generalization of Scott’s Pi evaluation metric for two annotators extended to multiple annotators. Whereas Scott’s pi and Cohen’s kappa work only for two raters, Fleiss’ kappa works for any number of raters giving categorical ratings, to a fixed number of items. In addition to that, not all raters are required to annotate all items. The measure calculates the degree of agreement in classification over that which would be expected by chance. Landis and Koch [63] created a table for interpreting the kappa values for a 2-annotators 2-class example, but this is not universally accepted, and it is not based on evidence but on personal opinion. The magnitude of the kappa is known to be affected by number of categories and subjects. Fleiss provided an equally arbitrary guidelines for his kappa: over 0.75 as excellent, 0.40- 0.75 as fair to good and bellow 0.40 as poor.

The formula for the computation of Fleiss’ Kappa is

$$\begin{aligned} k = \frac{P-P_e}{1-P_e} \end{aligned}$$
(2)

where P is a measure of the agreement in the annotation and \(P_e\) is a measure of the balance in the labels of the annotated set. In this way, for a perfect agreement between annotators, P is 1, decreasing as the agreement reduces. On the other hand, Pe is 1 in a set where all the samples have the same label, and has smaller values for equally balanced sets, with a value that depends on the number of possible labels. The higher the number of possible labels the smaller the value of \(P_e\). In this way, \(P_e\) works as a modifier of the significance of the agreement reached with P reducing the value of k if the labels in the set are unbalanced.

Table 6 Fleiss’ Kappa per label

The results in Table 6 indicate a good to excellent level of agreement, with P ranging from \(95\%\) to \(70\%\) and a fair to poor balance of labels in the annotated datasets, with fair \(P_e\) in gender, car colour or car type and very unbalanced labels in age and means of transport. The combination of the level of agreement and the balance in the dataset provides k that are from excellent to fair being the most problematic cases the age due to a very unbalanced labeling and skin tone due to a combination of fair agreement and poor-fair balance.

The computation of Fleiss’ Kappa assumes equal importance for any kind of disagreement, but, as mentioned earlier, in our sets not all the disagreements indicate the same. To give some insights about the nature of the disagreements, first we will break down some statistics about the different disagreements in Table 7. Specifically, we will separate the disagreements by the number of raters that disagree and will remove some of the “softer” labels such as “unknown” or “not clear”. According to the number of possible labels M for a given attribute and the number of raters N, the different outcomes are set by a combination with replacement of N elements taken from a set of M, denoted as \(C^R(M,N)\). Table 7 shows the number of samples for each labelling outcome split by attribute. The first row of each category shows the labels without any filtering. The second row shows the same labeling discarding the softer disagreements.

Table 7 Distribution of agreed and disagreed samples for the inter-agreement subset per category

The disagreement “level” is measured by means of the \(k_{score}\) which evaluates the number of repeated labels for a single annotation. For a perfect agreement, its value is 1 and decreases with the number of different labels and its dispersion. As far as the number of possible outcomes is limited by the number of possible labels M and the number of annotators N, the \(k_{score}\) is a discrete and tabulated value that can be observed in Table 7. The \(k_{score}\) is computed according to eq. 3.

$$\begin{aligned} k_{score} = \frac{l_1^2 + l_2^2 + \cdots + l_{m-1}^2 + l_{m}^2 - N}{N(N-1)} \end{aligned}$$
(3)

where \(l_i\) is the number of labels given to category i by the annotators for a specific attribute and agent.

Qualitative analysis

In addition to the numerical results, we would like to present some qualitative examples of representative disagreements on Fig. 11. The first one, in Fig. 11 top row third picture, is what we consider a “soft” or “fair” disagreement in which some annotators were reluctant to assign a label to a difficult labelling situation. The picture shows a person looking at a shop window, with male morphology but wearing a bag. His/her face or hair is not visible. Three of the annotators decided that the morphology and the bag was enough to assign a gender while other two decided that it was not clear. Here we find a problem in training the annotators on the degree of certainty that they need in order to decide assigning a label. Although this was previously trained with the annotators it is very difficult to establish a solid criterion. The second one is what we could consider errors in annotation, that is usually when 4 annotators agree (4/1). In Fig. 11 top row second picture, we have pedestrian number 3 which received {bicycle, pedestrian, pedestrian, pedestrian, pedestrian} and number 4 which received {pedestrian, bicycle, bicycle, bicycle, bicycle} in what seems to be a clear mistake in the agent annotation for annotator 1. In agent 4 annotation of gender we have another example of “soft” disagreement as it got 3/1/1 for gender with {female, female, unknown c, unknown nc, female}. Examples from the vehicles dataset are equivalent in the conclusions. In Fig. 11 bottom row last picture, we can see a vehicle that got 3 grey and 2 unknown for colour, which is considered a “soft” disagreement. On the other hand the type of the car received 2 large, one medium, one small and one unknown. The size of the car is not visible on the image, an it can only be inferred from the previous knowledge of the annotator or the appearance of the visible area. In this case, it seems a small/medium car and the two large are considered as errors in annotation.

Fig. 10
figure 10

Final distribution of annotated samples in the persons and vehicles datasets. The number of samples per dataset is more balanced than the original distribution of samples (depicted in Fig. 1)

Fig. 11
figure 11

Examples of disagreements for pedestrian agents (top row) and vehicle agents (bottom row). Starting from the top left, example of “soft” disagreement on “age” {kid, unknown, adult, kid, kid}. Top row second picture is an example of a disagreement probably by confusion on “means of transport” {bicycle, pedestrian,pedestrian, pedestrian, pedestrian} for agent #3 and {pedestrian, bicycle, bicycle, bicycle, bicycle} for agent #4. Top row third picture is an example of soft disagreement in gender {female, female, unclear, male, unclear}. Top row last picture is an example of soft disagreement on “skin” {light skin, dark skin, dark skin, dark skin, dark skin}. On the bottom row first picture there is an example of “vehicle type” disagreement {car, bus, bus, car, bus}. On the bottom row second picture there is an example of “vehicle type” disagreement {truck, truck, truck, unknown, truck}. On the bottom row third picture there is an example of “soft” disagreement on “colour” {black, unknown, black, unknown, black}. On the bottom row last picture there is an example of “car type” disagreement {large, small, medium, unknown, large}

Complete results

Fig. 12
figure 12

Annotated “Age”, “Group”, “MoT”, “Sex”, and “Skin tone” distribution for the full datasets, per dataset. Diversity across “Age” and “MoT” is notably low for all datasets. “Group” diversity is reasonable in most datasets. “Sex/gender” shows a slight bias towards males in most datasets. The most diverse datasets with respect to “skin tone” are BDD100K and nuImages, although there is a considerable bias towards light skin

Fig. 13
figure 13

Annotated “vehicle type”, “car type”, and “colour” distribution for the full datasets, per dataset. The most diverse dataset in terms of “vehicle type” is nuImages, although there is a considerably high bias towards cars. The distribution of “car types” largely depends on the geographical location of the datasets. The distribution of “vehicle colour” is fairly uniform across all datasets

After analyzing the inter-agreement results, in this section, we are presenting the statistic results of the annotation of the full data-sets. The final distribution of annotated samples is depicted in Fig. 10.

Figure 12 depicts the percentages of annotation for the categories “age”, “mean of transport”, “sex”, and “skin tone” in the six datasets. Equivalently, Fig. 13 depicts the percentages of annotation for the vehicle-related categories; “type of vehicle”, “type of car”, and “colour” in the three BDD100K, KITTI, and nuImages datasets.

Regarding the “age” category, and as expected from the results in "Inter-agreement results" Section, we can see that the datasets are extremely unbalanced, with over \(90\%\) of the samples being “adult”. On average, only \(1\%\) of the samples are considered kids under 14 years, and most of the differences between datasets come from the samples marked as “unknown”.

The analysis of the “mean of transport” attribute shows that most of the samples have been annotated as pedestrians across all datasets, with the exception of TDC, which is a bicycle riders dataset. KITTI has \(\sim 20\%\) of bicycle riders and BDD100K, CityPersons and Eurocity have \(\sim 7\%\). nuImages has very few samples labelled as riders. Wheelchairs and personal mobility devices are extremely underrepresented with less than \(1\%\) of the samples across datasets.

Datasets are slightly biased towards with an average of \(~55\%\) of the samples for “male” and \(\sim 35\%\) for “female” for the “sex” attribute. But it is worth noticing that the uncertainty (“unknown” labels) is high and the datasets with the lower uncertainty have the higher bias between “male” and “female”. This might indicate that if there were not as many “unknown” samples in Eurocity and TDC the bias could be higher.

In general, there is an underrepresentation of the label “dark skin” in the “skin tone” attribute, with BDD100K and nuImages presenting a higher percentage (\(12\%\) and \(18\%\)) as both have been recorded at least partially on the USA. It is also noticeable the high “unknown” percentage in TDC (\(\sim 50\%\)) most probably due to being a riders dataset recorded during cold weather, and thus with riders wearing many clothes.

Moving to the vehicles datasets, most of the vehicles are cars (\(\sim 75\%\)) with higher presence of trucks in nuImages (\(\sim 10\%\)) and BDD100K (\(\sim 5\%\)) probably due to being recorded in the USA. Motorcycles have a low representation being a \(\sim 5\%\) in nuImages the higher percentage.

For the “car type” category there is a general prevalence of “large” car type, specially in BDD100K and nuImages. KITTI has a better balance between “large” and “medium”, “small”, but still large vehicles nearly doubles any other type.

The dominant labels for the “colour” attribute are black, grey, and white across the datasets. The remaining colours are also balanced, with a lower representation.

Conclusions

In this paper we presented a new set of annotations for subsets of some of the most commonly utilised visual datasets in the autonomous driving domain. To do so, we developed a specialised annotation tool that allowed multiple users to simultaneously annotate more than 90K individuals and 50K vehicles. In order to minimize common errors, biases, and discrepancies among annotators we measured different inter-agreement statistics and developed a methodology aimed at reducing the differences in annotation for different raters. The reported inter-annotator agreement was fair to excellent according to the computed Fleiss’ Kappa. Among the attributes with lowest agreement we can find gender and skin tone, probably due to the difficulty of establishing a solid criteria between the binary labelling (ie male–female) and the not clear cases. This is supported by the fact that the number of 4/1 disagreements is reduced drastically in these categories if we remove the “softer” labels such as “unclear”. We also presented a qualitative analysis of the disagreements that confirmed the above conclusions with most disagreements coming from a difficulty on establishing the limits between labels (normally in 3/2, 3/1/1, 2/2/1) and from plain errors in the annotation (most of them to 4/1).

Regarding the results of the balance for the different categories in the datasets, we found strongly under-represented labels such as “kid” with approximately a \(1\%\) across datasets, with KITTI (\(2.71\%\)) being the highest, “dark skin” with some significant representation only in BDD100K (\(11.84\%\)) and nuImages (\(18.01\%\)), both recorded in the USA, “personal mobility devices”, “wheelchairs”, with an average of less than \(1\%\) and “buses” and “motorcycles” with a small representation in nuImages (\(\sim 4\%\)). In general, the gender is biased towards “male” (\(60-50\%\) vs \(20-35\%\)) but with a smaller gap in the datasets where the percentage of “unknown” is higher, which might indicate that the closer to female the higher the probability an annotator will choose “unclear”. Looking at the vehicles, “car” is clearly predominant, specially in BDD100K and KITTI with more than \(85\%\) while in nuImages there is a fair amount of “truck” (\(\sim 10\%\)) and “unknown” (\(\sim 11\%\)) with reduces the “car” labels to \(63.28\%\). Among these “car” labelled the predominant “car type” is “large” in BDD100K and nuImages, with a fair representation of “medium” and “small” only in KITTI. Finally, the distribution of vehicle colour is fairly uniform across datasets, being “black”, “white” and “grey” the most common (\(\sim 20\%\) each).

Although bias identification in training datasets is arguably the most crucial step when addressing algorithmic fairness, this is just the first necessary step that paves the way for further research in this field. Our ongoing and future work focuses on evaluating pre-trained and available models for object detection and person/vehicle detection and prediction, in relation to the annotated equity-relevant attributes across datasets. The goal will be to investigate potential performance differences depending on the type of agent. Additionally, we plan to examine the sources of biased performance and explore various strategies to mitigate such biases and guarantee fairness. We hope that our proposed annotation methodology and tool, as well as the publicly available annotation attributes, will contribute to the integration of fairness metrics in future evaluations of perception and prediction systems in the autonomous driving domain.

Availability of data and materials

This paper follows reproducibility principles. The annotation tool, scripts and the annotated attributes can be accessed publicly at https://github.com/ec-jrc/humaint_annotator.

Notes

  1. While the potential bias in datasets from other data modalities, such as range-based (e.g., LiDAR, radar) or infrared spectrum, may also be significant, the analysis of these ones is beyond the scope of this paper.

  2. We acknowledge that gender is not a binary construct and that an individual’s gender identity may not align with their perceived or intended gender presentation. However, for the purposes of this study, we instruct annotators to categorize based on predominantly feminine or masculine presentations even though these may not align with the individual’s self-identified gender.

References

  1. Fernández-Llorca D, Gómez E. Trustworthy artificial intelligence requirements in the autonomous driving domain. Computer. 2023;56(2):29–39.

    Article  Google Scholar 

  2. Padmaja B, Moorthy C, Venkateswarulu N, Madhu-Bala M. Exploration of issues, challenges and latest developments in autonomous cars. J Big Data. 2023;10(61):1–24.

    Google Scholar 

  3. Hardt M, Price E, Price E, Srebro N. Equality of opportunity in supervised learning, In: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16), 2016:3323–3331.

  4. Tolan S. Fair and unbiased algorithmic decision making: Current state and future challenges, Digital Economy Working Paper - JRC Technical Reports, 2019;JRC113750.

  5. Fernández Llorca D, Gómez E. Trustworthy autonomous vehicles, EUR 30942 EN, Publications Office of the European Union, Luxembourg, 2021;JRC127051.

  6. Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu S. X. Large-scale long-tailed recognition in an open world, In: Computer Vision and Pattern Recognition, 2019:2537–2546.

  7. Sánchez HC, Parra NH, Alonso IP, Nebot E, Fernández-Llorca D. Are we ready for accurate and unbiased fine-grained vehicle classification in realistic environments? IEEE Access. 2021;9:116-338–116-355.

    Article  Google Scholar 

  8. Pessach D, Shmueli E. Algorithmic fairness, arXiv preprintarXiv:2001.09784, 2020.

  9. Yang K, Qinami K, Fei-Fei L, Deng J, Russakovsky O. Towards fairer datasets: filtering and balancing the distribution of the people subtree in the imagenet hierarchy, In: Conference on fairness, accountability, and transparency (FAT), 2020;547-558.

  10. Wilson B, Hoffman B, Morgenstern J. Predictive inequity in object detection, In: Workshop on fairness accountability transparency and ethics in computer vision at CVPR, 2019.

  11. Brandao M. Age and gender bias in pedestrian detection algorithms, In: Workshop on fairness accountability transparency and ethics in computer vision at CVPR, 2019.

  12. Li X, Chen Z, Zhang J M, Sarro F, Zhang Y, Liu X. Bias behind the wheel: Fairness analysis of autonomous driving systems, arXiv:2308.02935, 2024.

  13. Fabbrizzi S, Papadopoulos S, Ntoutsi E, Kompatsiaris I. A survey on bias in visual datasets. Comput Vision Image Underst. 2022;223: 103552.

    Article  Google Scholar 

  14. Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T. Model cards for model reporting, In: Proceedings of the conference on fairness, accountability, and transparency, ser. FAT* ’19, 2019:220-229.

  15. Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, au2 HDI, Crawford K. Datasheets for datasets, arXiv 1803.09010, 2021.

  16. Schumann C, Ricco S, Prabhu U, Ferrari V, Pantofaru C. A step toward more inclusive people annotations for fairness, In: Proceedings of the 2021 AAAI/ACM conference on AI, ethics, and society, ser. AIES ’21, 2021:916-925.

  17. Dasiopoulou S, Giannakidou E, Litos G, Malasioti P, Kompatsiaris Y. A survey of semantic image and video annotation tools, In: Knowledge-Driven multimedia information extraction and ontology evolution: bridging the semantic gap, 2011:196–239.

  18. Pande B, Padamwar K, Bhattacharya S, Roshan S, Bhamare M. A review of image annotation tools for object detection. Int Conf Appl Artif Intell Comput (ICAAIC). 2022;2022:976–82.

    Google Scholar 

  19. Fabris A, Messina S, Silvello G, Susto GA. Algorithmic fairness datasets: the story so far. Data Min Knowl Discov. 2022;36:2074–152.

    Article  Google Scholar 

  20. Buolamwini J, Gebru T. Gender shades: Intersectional accuracy disparities in commercial gender classification, Proceedings of machine learning research. conference on fairness, accountability, and transparency, 2018;81:1-15.

  21. Ryu HJ, Adam H, Mitchell M. Inclusivefacenet: Improving face attribute detection with race and gender diversity, In: 2018 workshop on fairness, accountability, and transparency in machine learning (FAT/ML 2018), 2018.

  22. Rhue L. Racial influence on automated perceptions of emotions, Social Science Research Network (SSRN), 2018.

  23. Xu T, White J, Kalkan S, Gunes H. Investigating bias and fairness in facial expression recognition, in Computer Vision - ECCV 2020. Workshops. 2020;506–23.

  24. Choi J, Gao C, Messou CEJ, Huang J-B. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition, In: 33rd conference on neural information processing systems (NeurIPS), 2019.

  25. Hendricks LA, Burns K, Saenko K, Darrell T, Rohrbach A. Women also snowboard: Overcoming bias in captioning models, In: European conference on computer vision (ECCV), 2018:793–811.

  26. Kärkkäinen K, Joo J. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation, In: IEEE/CVF winter conference on applications of computer vision (WACV), 2021;1548–1558.

  27. Merler M, Ratha N, Feris RS, Smith JR. Diversity in faces, arXiv:1901.10436, 2019.

  28. Georgopoulos M, Panagakis Y, Pantic M. Investigating bias in deep face analysis: the kanface dataset and empirical study. Image Vision Comput. 2020;102: 103954.

    Article  Google Scholar 

  29. Hazirbas C, Bitton J, Dolhansky B, Pan J, Gordo A, Ferrer CC. Towards measuring fairness in ai: the casual conversations dataset. IEEE Trans Biom Behav Ident Sci. 2022;4(3):324–32.

    Article  Google Scholar 

  30. Amazon, Sagemaker clarify, 2023. https://aws.amazon.com/sagemaker/clarify/. Accessed 26 Aug 2024.

  31. Google, Know your data, 2023. https://knowyourdata.withgoogle.com/. Accessed 26 Aug 2024.

  32. Wang A, Liu A, Zhang R, Kleiman A, Kim L, Zhao D, Shirai I, Narayanan A, Russakovsky O. Revise: a tool for measuring and mitigating bias in visual datasets. Int J Comput Vis. 2022;130:1790–810.

    Article  Google Scholar 

  33. Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite, In: 2012 IEEE conference on computer vision and pattern recognition. 2012;3354–61.

  34. Li X, Flohr F, Yang Y, Xiong H, Braun M, Pan S, Li K, Gavrila DM. A new benchmark for vision-based cyclist detection, In: 2016 IEEE intelligent vehicles symposium (IV). 2016;1028–33.

  35. Zhang S, Benenson R, Schiele B. CityPersons: A Diverse Dataset for Pedestrian Detection, In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). 2017;4457–65.

  36. Braun M, Krebs S, Flohr F, Gavrila DM. EuroCity persons: a novel benchmark for person detection in traffic scenes. IEEE Trans Pattern Anal Mach Intell. 2019;41(8):1844–61.

    Article  Google Scholar 

  37. Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, Madhavan V, Darrell T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning, In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020;2633–42.

  38. Motional, Nuimages, 2020. https://www.nuscenes.org/nuimages. Accessed 26 Aug 2024.

  39. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban scene understanding, In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). 2016;3213–23.

  40. Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom O. nuScenes: A Multimodal dataset for autonomous driving, In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020;11 618–11 628.

  41. Vijayanarasimhan S, Grauman K. What’s it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations, In: IEEE conference on computer vision and pattern recognition, 2009;2262–2269.

  42. ten Hove D. Interrater reliability for incomplete and dependent data, Thesis, fully internal, Universiteit van Amsterdam, 2023.

  43. Gonzalo RI, Maldonado CS, Ruiz JA, Alonso IP, Llorca DF, Sotelo MA. Testing predictive automated driving systems: lessons learned and future recommendations. IEEE Intell Transp Syst Mag. 2022;14(6):77–93.

    Article  Google Scholar 

  44. Escobar DA, Cardona S, Hernández-Pulgarin G. Risky pedestrian behaviour and its relationship with road infrastructure and age group: an observational analysis. Saf Sci. 2021;143: 105418.

    Article  Google Scholar 

  45. Parra Alonso I, Fernandez Llorca D, Sotelo MA, Bergasa LM, Revenga de Toro P, Nuevo J, Ocana M, Garcia Garrido MA. Combination of feature extraction methods for svm pedestrian detection. IEEE Trans Intell Transp Syst. 2007;8(2):292–307.

    Article  Google Scholar 

  46. Zeedyk M, Kelly L. Behavioural observations of adult-child pairs at pedestrian crossings. Accid Anal Prev. 2003;35(5):771–6.

    Article  Google Scholar 

  47. Lord S, Cloutier M-S, Garnier B, Christoforou Z. Crossing road intersections in old age-with or without risks? perceptions of risk and crossing behaviours among the elderly. Transp Res Part F: Traffic Psychol Behav. 2018;55:282–96.

    Article  Google Scholar 

  48. Wang X, Zheng S, Yang R, Zheng A, Chen Z, Tang J, Luo B. Pedestrian attribute recognition: a survey. Pattern Recognit. 2022;121:108220.

    Article  Google Scholar 

  49. Wells JC. Sexual dimorphism of body composition. Best Pract Res Clin Endocrinol Metab. 2007;21(3):415–30.

    Article  Google Scholar 

  50. Holland C, Hill R. Gender differences in factors predicting unsafe crossing decisions in adult pedestrians across the lifespan: a simulation study. Accid Anal Prev. 2010;42(4):1097–106.

    Article  Google Scholar 

  51. Fitzpatrick TB. Soleil et peau. J Med Esthet. 1975;2:33–4.

    Google Scholar 

  52. Zhang S, Wen L, Bian X, Lei Z, Li SZ. Occlusion-aware r-cnn: detecting pedestrians in a crowd. Comput Vision - ECCV. 2018;2018:657–74.

    Google Scholar 

  53. Rasouli A, Tsotsos JK. Autonomous vehicles that interact with pedestrians: a survey of theory and practice. IEEE Trans Intell Transpo Syst. 2020;21(3):900–18.

    Article  Google Scholar 

  54. Rosenbloom T. Crossing at a red light: behaviour of individuals and groups. Transp Res Part F: Traffic Psychol Behav. 2009;12(5):389–94.

    Article  Google Scholar 

  55. Laverdet C, Malola P, Meyer T, Delhomme P. Electric personal mobility device driver behaviors, their antecedents and consequences: a narrative review. J Saf Res. 2023. https://doi.org/10.1016/j.jsr.2023.07.006.

    Article  Google Scholar 

  56. Tsai L-W, Hsieh J-W, Fan K-C. Vehicle detection using normalized color and edge map. IEEE Trans Image Process. 2007;16(3):850–64.

    Article  MathSciNet  Google Scholar 

  57. Chen P, Bai X, Liu W. Vehicle color recognition on urban road by feature context. IEEE Trans Intell Transp Syst. 2014;15(5):2340–6.

    Article  Google Scholar 

  58. Newstead S, D’Elia A. Does vehicle colour influence crash risk? Saf Sci. 2010;48(10):1327–38.

    Article  Google Scholar 

  59. Paullada A, Raji ID, Bender EM, Denton E, Hanna A. Data and its (dis)contents: a survey of dataset development and use in machine learning research. Patterns. 2021;2(11):100336.

    Article  Google Scholar 

  60. Northcutt C G, Athalye A, Mueller J. Pervasive label errors in test sets destabilize machine learning benchmarks, In: 35th conference on neural information processing systems (NeurIPS), 2021.

  61. Rädsch T, Reinke A, Weru V, Tizabi MD, Schreck N, Kavur AE, Pekdemir B, Roß T, Kopp-Schneider A, Maier-Hein L. Labelling instructions matter in biomedical image analysis. Nat Mach Intell. 2023;5(3):273–83.

    Article  Google Scholar 

  62. de Bruijn L. Inter-Annotator Agreement (IAA), 2020. https://towardsdatascience.com/inter-annotator-agreement-2f46c6d37bf3. Accessed 29 Oct 2023.

  63. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank all those who participated in the annotation process.

Funding

This work was mainly funded by the HUMAINT project by the Directorate-General Joint Research Centre of the European Commission. It was also partially funded by Research Grants PID2020-114924RB-I00 and PDC2021-121324-I00 (Spanish Ministry of Science and Innovation).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualisation: DFL, EG; Methodology: DFL, PF; Formal analysis and investigation: All authors; Annotation process: IP, RI; Writing—original draft preparation: DFLL; Writing—review and editing: All authors; Funding acquisition: DFL, EG. All authors read and approved the final manuscript.

Corresponding author

Correspondence to David Fernández Llorca.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests. The views expressed in this article are purely those of the authors and may not, under any circumstances, be regarded as an official position of the European Commission.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Annotation session guidelines

To ensure the quality of the annotation please follow the indications:

  • Plan the annotation sessions in advance. It’s better to have short, regular sessions rather than long, asynchronous ones.

  • Set realistic, not ambitious, annotation objectives for each session, and stop once you reach them. Since the number of agents isn’t always immediately accessible, consider setting time-based objectives instead.

  • Before beginning the annotation process, prepare your table and PC. Ensure that the lighting is uniform and that you’re sitting in a comfortable position.

Appendix 2

Persons annotation guidelines

See Table 8.

Table 8 General persons annotation guidelines

Appendix 3

Vehicle annotation guidelines

See Table 9

Table 9 General Vehicles Annotation Guidelines

Appendix 4

Standardised data format

In order to keep track of every aspect of the image, the first level of the dictionary contains image metadata, including the elements listed in Fig. 14. We include an identifier or nickname of the annotator, as well as a flag to allow the image to be discarded due to an error in the original labelling of the image (all datasets contain a small number of mislabeled instances). Each agent is univocally identified using an agent image id and a unique identifier (uuid). We include the bounding box coordinates and the identity (main label). Then, we want to show all attributes for each agent, and whether they have or not any sub-entity. For the attributes, we differentiate between our annotations, and attributes which were already given by the dataset itself (sandbox tags). Sub-entities constitute agents by themselves, but depend on other agents. For example, on certain datasets we might have information about vehicles and their driver. The driver would be the main agent, and the vehicle is the sub-entity of that agent. Figure 15 depicts the format of the data for the agents. Finally, we add the new annotated attributes. An example of the attributes of the person type agent is shown in Fig. 16.

Fig. 14
figure 14

Image metadata format

Fig. 15
figure 15

Format of the generic attributes for each agent

Fig. 16
figure 16

Proposed format for the additional annotated attributes for persons. The format for vehicle type agents is equivalent

Appendix 5

Statistical annotation result

See Tables 10 and 11

Table 10 Percentage for each of the pedestrian labels in the different datasets
Table 11 Percentage for vehicle labels in the different datasets

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fernández Llorca, D., Frau, P., Parra, I. et al. Attribute annotation and bias evaluation in visual datasets for autonomous driving. J Big Data 11, 137 (2024). https://doi.org/10.1186/s40537-024-00976-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-024-00976-9

Keywords