A systematic scrutiny of artificial intelligence-based air pollution prediction techniques, challenges, and viable solutions

Malhotra, Meenakshi; Walia, Savita; Lin, Chia-Chen; Aulakh, Inderdeep Kaur; Agarwal, Saurabh

doi:10.1186/s40537-024-01002-8

Survey
Open access
Published: 09 October 2024

A systematic scrutiny of artificial intelligence-based air pollution prediction techniques, challenges, and viable solutions

Meenakshi Malhotra¹,
Savita Walia¹,
Chia-Chen Lin²,
Inderdeep Kaur Aulakh³ &
…
Saurabh Agarwal⁴

Journal of Big Data volume 11, Article number: 142 (2024) Cite this article

237 Accesses
1 Altmetric
Metrics details

Abstract

Air is an essential human necessity, and inhaling filthy air poses a significant health risk. One of the most severe hazards to people’s health is air pollution, and appropriate precautions should be taken to monitor and anticipate its quality in advance. Among all the countries, the air quality in India is decreasing daily, which is a matter of concern to the health department. Many studies use machine learning and Deep learning methods to predict atmospheric pollutant levels, prioritizing accuracy over interpretability. Many research studies confuse researchers and readers about how to proceed with further research. This paper aims to give every detail of the considered air pollutants and brief about the techniques used, their advantages, and challenges faced during pollutant prediction, which leads to a better understanding of the techniques before starting any research related to air pollutant prediction. This paper has given numerous prospective questions on air pollution that piqued the study’s interest. This study discussed various machine and deep learning methods and optimization techniques. Despite all the discussed machine learning and deep learning techniques, the paper concluded that more datasets, better learning techniques, and a variety of suggestions would enhance interpretability while maintaining high accuracy for air pollution prediction. The purpose of this review is also to reveal how a family of neural network algorithms has helped researchers across the globe to predict air pollutant(s).

Introduction

Pollution is a significant issue, affecting water, air, noise levels, land, or any other aspect of our environment. Every living human seeks a healthy life. They need clean water, fresh air, a peaceful day, and eco-friendly surroundings. But imagine - what if all these significant sources are contaminated? The drinking water gets infected with chemicals, breathing becomes difficult due to toxins in the air, and the surroundings are noisy and chaotic. How difficult would life become? Pollution, in a way, is a concern. However, water and air pollution are the most alarming of all.

With the rapid increase in the worldwide population, the demand for necessities like food, land, and water has also peaked. Countries like China and India are topping the list for being the most populated countries. Still, according to diverse research found on the internet, the population in China is moderately high compared to others [1]. The increasing rate of population explicitly affects the environment, and air pollution has emerged as a significant challenge in recent years. India has been ranked among the most populated countries, with 1.40 billion people [2] accounting for 17.71% of the global population. In the past few years, the birth and death ratios, and especially the death rate due to pollution, have drastically increased, as shown in Table 1.

Table 1 Birth rate and death rate over the last eight years in India

Full size table

Air pollutants

Air pollution has reached crisis levels in most major cities, necessitating air quality predictions. The variety of temporal sequence information, combining one-dimensional and multidimensional panels, makes air quality prediction challenging. Before delving into the details of pollutant prediction, it is essential to identify the types of pollutants that exist and understand how they react to various situations (Table 2).

Table 2 Air pollutants details

Full size table

a.
Primary Pollutants: These pollutants are found directly in the air. Emissions from automobiles, power plants, biomass burning, forest fires, volcanoes, and other sources can all contribute to these problems. Gasoline and diesel are the key sources of air pollution in transportation. Industries produce primary pollutants such as NO2, CO, SOx, PM, Mercury, Organic Compounds (VOCs), and others in large quantities.
b.
Secondary Pollutants: Secondary pollutants form from primary pollutants reacting with each other or with molecules in the atmosphere rather than being directly generated from sources. O₃, sulphuric acid and nitric acids are secondary pollutants, as are Peroxyacyl Nitrates (PANs), Smog, PM, and NO₂.
c.
Airborne air pollutants, such as persistent organic pollutants (POPs), CO, sulfur oxide, and nitrogen oxide, exist in the atmosphere. Penetration can occur in various forms: as volatile gases like ozone and benzene, as droplets such as sulfuric acid and nitrogen dioxide, or as particulate matter including diesel exhaust and aromatic hydrocarbons.
d.
Aerosol Pollutants: Fog, dust, forest exudates, and geyser steam are naturally occurring aerosols. Aerosols are invisible. Microscopic particles comprise a few molecules; some are visible but still very small. Haze, delicate particulate matter, and smoke are examples of anthropogenic aerosols. Aerosols have an indirect or negligible influence on greenhouse gases, causing cooling. It can scatter light and alter the reflectance of the earth’s surface. Frequent aerosols include sea salt, dust, and volcanic ash.
e.
Particulate Matter: It can be caused by both human and natural factors. Natural activities that create PM include road dust into the air, agriculture (stub burning, woodland burning, etc.), industrial processes, fossil fuel combustion, etc. The wind blows dust, and wildfires contribute to the production of PM. Particles with dimensions of 10 micrometers or 2.5 micrometers are produced by forest fires, volcanoes, dust storms, and human activities such as stubble burning and burning fossil fuels through vehicles, power stations, and road dust.
f.
Meteorological Data: It is known as a weather pattern. It helps to locate pollution sources and predict days with high pollutant concentrations. Computation models also contribute to the estimation of air quality. For example, weather data can include sun radiation, temperature, humidity, wind direction, and speed.
g.
Air Quality Index: It determines the cleanliness of the air and its impact on public health. To calculate AQI according to the Indian Government (CPCB), out of the available air pollutants, at least three air pollutants should be present for the calculation. Out of those three pollutants, one must be PM₁₀ or PM_2.5, or both. It is used to keep the public’s health safe. A higher number on the AQI scale, which ranges from 0 to 500, denotes more polluted air and a higher danger to health, and depending upon the AQI level, the impact on people’s health can be identified (Table 3).

Table 3 Air pollutants range as per AQI levels in µg/m³

Full size table

Pollution affects three broad areas

a.
Pollution Affects Ecosystem: Acid Rain is caused mainly by air pollution, and it has the most significant impact on plants that rely on rainwater. The most dangerous pollutants impacting plants include ammonia, nitrogen dioxide, sulfur dioxide, and ozone, which can kill the entire flora.

i.
Sulphur: Sulphur levels in lakes and streams that are too high can harm plants and forest soil.
ii.
Nitrogen: When deposited on surface water, it may affect aquatic vegetation and species.
iii.
Ozone: It damages tree leaves in protected natural areas and hurts scenic.
iv.
Mercury: It is a heavy metal complex released into the environment and deposited on land and water, accumulating in plants and animals, some of which people consume directly or indirectly.
v.
Particulate Matter: When particle pollution levels are high in one place, it can seriously affect forests, wildlife, and coastal areas. Large areas of dead trees are common in PM-affected ecosystems because when groundwater becomes too acidic, critical minerals drain out of the soil, preventing plants from growing.

b.
Pollution Affects Public Health: Air pollution, whether indoor or outdoor, impacts people’s lives. Particulate matter, sulfur dioxide, nitrogen dioxide, and ozone are the pollutants that have the most significant evidence to be harmful to public health. The health risks associated with particulate matter smaller than 10 and 2.5 millimeters in diameter are detailed in Table 4. PM can enter the lungs and circulatory system, causing cardiovascular, cerebrovascular, and respiratory consequences [8, 9]. Particulate matter (PM), methane, carbon monoxide, polyaromatic hydrocarbons (PAH), and volatile organic compounds (VOC) are produced when wood and coal are burned in inefficient stoves or open hearths. When kerosene burns in classic wick lamps, many fine particles and other contaminants are created [10]. Indoor air pollution, which significantly influences health issues, is caused by vapors from building materials, paints, furniture, cooking, heating, and tobacco smoking [11, 12]. A study of indoor contaminants discovered that 80% of particle mass concentrations were created [13]. An office building, school, house, cafeteria, or restaurant are examples of indoor locations. Most pollutants had substantially greater concentrations indoors than outside, and new regulations and recommendations should be created to cope with PM concentrations in kindergartens and schools [14]. Because there is always a link between indoor and outdoor quality, it is critical to keep track of the pollutants on any given day or at any given moment [15]. NO_2, PM_2.5, and PM₁₀ concentrations in classrooms and outdoors have distinct compositions, sources, and contributions [16]. Outdoor air quality, as well as other factors such as the kind of windows, have a significant impact on indoor PM concentrations [17]. Seasons affect indoor and outdoor pollutants; for example, during the rainy season, the indoor/outdoor ratio of PM_2.5, PM₁₀, and TVOC increases [18]. People worldwide are now contributing to doing smaller to more prominent research to gain information about air pollutants [19,20,21].
c.
Pollution Affects Sustainable Development: Sustainable development satisfies current requirements without compromising future generations’ ability. It balances economic progress, environmental preservation, and human well-being. Many organizations are developing sustainable development goals to protect the environment and restore ecological balance since environmental contamination seriously threatens human growth and sustainable development.

Table 4 AQI Range, Color, and their health impact

Full size table

Research questions (RQs)

When discussing air pollution, several questions commonly arise. Some of these questions include:

RQ1:

How does air pollution affect different aspects?

RQ2:

How does one city’s pollution level affect the neighboring city’s AQI index?

RQ3:

What are the methods used to analyze the future air quality?

RQ4:

What steps are the government or organizations taking to deal with the air pollution problem?

The first thing that springs to mind is how air pollution influences different aspects. Is it simply affecting a few points, or is it affecting the entire society? In the Introduction section, this topic was addressed. The second question is: How are other cities or places affected if one city is harmed by air pollution? Because pollutants have some particle weight (very little), do they travel? Correlation is the answer to this. The correlation between the two particles might be positive or negative depending on the climatic variable values. The relationship between meteorological data and air pollutants changes depending on the season, such as spring, summer, autumn, or winter. The temperature has a negative relationship with PM_2.5.

In contrast, precipitation has a positive relationship, and wind speed transports pollutants to neighboring border areas, which depends on how fast or slow the wind is [22]. The monsoon, pre-monsoon, and post-monsoon all impact the correlation. Different geographies can also produce different correlation results [23]. WS, AT, RH, and sun radiation were chosen in the winter to determine the connection between the two. WS and AT were the most prevalent meteorological components in the winter, while the highest absorptions of SO₂, NO₂, NO, and NOx were found in December, January, and February [24]. The two most challenging aspects of determining the correlation between the two are seasonal and geographical characteristics [25]. The relationship between the two is influenced by geographical location and seasonal change [26]. Temperature is inversely related to PM_2.5 and PM₁₀ levels and humidity all of the time, but a positive relationship during the monsoon [27]. The third research question focuses on the methodologies used to analyze air quality or anticipate air pollution levels and the obstacles researchers encountered during pollutant prediction.

Additionally, it examines whether the government or responsible authorities have taken action to address poor air quality. The literature review section explored these issues, and the Need for Prediction section sought potential answers.

The main contribution of the paper is as follows:

a.
The paper starts by conducting a systematic review, thoroughly examining existing literature to gather insights on causes, health impacts, and recent advancements in air pollution prediction. This approach ensures that the paper is grounded in existing knowledge and can provide a holistic view of the subject matter.
b.
The paper aims to illuminate the various factors crucial for accurately predicting air pollution levels. This inclusivity helps identify key variables and methodologies that contribute to effective forecasting. Furthermore, by detailing aspects such as datasets, data sources, and the artificial intelligence methods utilized for prediction, the paper equips future researchers with essential information to navigate this complex field.
c.
Furthermore, by discussing the initiatives undertaken by the Indian government to address air pollution, the paper contextualizes the research within real-world efforts to mitigate its adverse effects.
d.
This paper aids as a valued source for researchers and policymakers, offering perspectives on the complexities of air pollution prediction and advocating for continued efforts toward its effective management.

Many machine learning and deep learning methods used to predict air pollutant concentration struggled with managing data uncertainty and addressing the spatial-temporal aspect of the dataset, which negatively impacted the prediction models’ performance. The current ANN model has enhanced performance by utilizing hybrid methods and optimizing parameters. The improvement in the model results in more accurate predictions while maintaining high precision.

The article is structured as follows: Sect. 2 reviews recent studies on single and multiple air pollutants using machine and deep learning techniques. Section Need for Prediction addresses the current need for air pollutant prediction and outlines the measures being taken by the Indian Government to address this critical issue. Finally, Sect. 4 concludes with future perspectives.

Literature review on air quality prediction

Data collection is the first and the most essential part of the analysis and prediction. For the air pollution prediction, the data can be traffic data (fuel consumption, age of the vehicle, emission rate, and traffic volume), Environmental data (pollutant concentration, humidity, wind speed, wind direction, etc.), Geographical Data (longitude, latitude, land use, etc.), Socioeconomic Data (population density and rate of urbanization). The data collection part contains data that can include a combination of outliers, missing values, etc.

This step indicates the preparation of data before feeding it to the model. This step fines the raw data into information by eliminating the outliers, missing data, erroneous data, irrelevant data, and inconsistent data from the data collection step. This step is also known as the data cleaning step. Different techniques are available to process raw data, including normalization/scaling and interpolation methods for handling missing data. Feature selection can also be included in the data pre-processing step to enable the model to train faster. In feature selection, only those features are selected, which improves the model’s accuracy, depending on the proper subset selection.

Choosing the proper technique is one of the tricky parts of the research field. There are various techniques to deal with problems related to the real world, depending on the research problems. Any method can be chosen, whether machine learning or deep learning. Artificial intelligence is the superset of machine learning, whereas deep learning is the subset of machine learning. Whether machine learning or deep learning is used, the primary focus is always on achieving the highest accuracy or the best possible results from the model. Any Machine Learning or Deep Learning algorithm can be used, depending on the size of the dataset and the problem requirements. Deep learning includes several varieties, including Neural Networks (Convolutional Neural Networks, Recurrent Neural Networks, Back Propagation Neural Networks [28], Long Short Term Memory Neural Networks, etc.), Stacked Auto Encoder, Deep Boltzmann Machines, Deep Belief Networks, and others [29,30,31].

In stage 1 (Fig. 1), we conducted keyword searches in Google Scholar, IEEE Xplore, and ScienceDirect, which initially resulted in results collection of a broad range of studies. In stage 2, after searching based on keywords, the further selections were based on titles and abstracts. In addition to that, we filtered out the studies with insufficient focus on AI applications in air pollutant prediction. In stage 3, we thoroughly read the full text and found out the methodologies used and the challenges faced by the researchers. A comprehensive list of studies for in-depth analysis was done in the last stage. More details regarding the same can be seen in Table 5.

Table 5 State-of-art literature review methodology

Full size table

One can forecast a single air pollutant and multiple air pollutants using machine learning and deep learning approaches. The below-mentioned literature review on air pollutant(s) prediction is divided into three categories: Single air pollutant prediction, Multiple air pollutant prediction, and Single & Multiple air pollutant predictions using deep learning approaches. Each defined prediction method contains all the essential points about the research in Tables 6 and 7, and Table 8.

Table 6 Single air pollutant prediction techniques

Full size table

Table 7 Multiple air pollutant prediction techniques

Full size table

Table 8 Single and multiple air pollutants deep leaning-based Prediction techniques

Full size table

Single air pollutant prediction

Single air pollutant prediction involves forecasting future concentrations of a specific pollutant using historical data and relevant environmental factors. Zhou et al. [32] used the Kendall tau coefficient to identify essential spatiotemporal features from regional meteorological and air quality data. Second, Multi-Task Learning (MTL) was used to train the Multi-Output Support Vector Machine (M-SVM) model to detect nonlinear interactions and disseminate correlation data across tasks. Finally, the M-SVM model was evaluated using PM_2.5 concentration and meteorological and air quality information. A spatial and temporal correlation was also considered so that the correlated station could deal with one missing value in one station. The conclusion focused on addressing the robustness and uncertainty of the discussed models in the future. The Auto-Regressive Integrated Moving Average with Explanatory Variable-Multi-Layer Perception (ARIMAX-MLP) approach [33] was employed to train the model using expected values, achieving high accuracy. The number of hidden layers utilized here was one or two depending on the input, with up to twenty hidden neurons in each layer. Four different training algorithms were examined with 50 population sizes. The stopping criteria were 30 (number of generations). They concluded that due to the uncertainty, the result gets affected, and the model can predict the data 10 days in advance. Cabaneros et al. [34] collected data from two monitoring sites, and a nonlinear activation function was used to process it. Stepwise Regression, Principal Component Analysis (PCA), and Classification and Regression Technique (CART) were used for feature selection. The network was trained using the Levenberg-Marquardt backpropagation technique, which changed the weights of interconnected neurons. Gradient descent was used as a nonlinear optimization method. The process was performed three times to account for the random initialization of weights across neurons.

To estimate the PM_2.5 concentration of six stations in China, the hybrid-Garch (Generalized Auto-Regressive Conditional Heteroskedasticity) model was considered [35]. ARIMA model was also considered for capturing the linear part of the data. On the other hand, SVM was used to deal with non-linearity. The authors failed to choose the proper data set as the data set was small, which led to an accuracy problem. An ANN approach and ARIMAX model predicted [36] the concentration of NO₂. To process the expected results, the relationship between the NO₂ concentration and various meteorological variables and traffic pollutants was also considered for the prediction. The correlation between PM_2.5 and various meteorological data was identified [37] using the MLP method. The wavelet transform of the time series was used to forecast the average daily concentrations of PM_2.5 based on the concentration from the day before and the expected values of the day’s meteorological data. As the activation function, the sigmoid function was utilized. The model predicted PM_2.5 values two days ahead. The novel aspect of the method was adding a trajectory-based geographic parameter to the ANN algorithm as an additional input predictor.

A scheme was proposed to estimate PM₁₀ concentrations for the subsequent 24 h using a combination of a multilayer perceptron (MLP) and a clustering algorithm [38]. The clustering algorithm explored the relationship between meteorological data and PM₁₀ pollutant data. An interpolation method was applied to handle missing values, and a log-sigmoid activation function was used. It was found that it would be challenging to add relevant information to the ANN’s input patterns by identifying groups with similar data features and uncovering associations between them. A combination of three techniques was recommended [39]. The goal was to precisely forecast the geographical distribution of ozone as an air pollutant across a sizable area while handling noisy input using the hidden neuron output information from the preceding iteration. Scheme [40] utilized forecast, actual weather, and pollution data. The model incorporated all data, including those with inaccurate, chaotic, and unpredictable values. Weather forecasts were first sorted, and then certain meteorological conditions were processed into Fuzzy sets, characterized using Fuzzy numbers. A collection of pollutant concentrations was derived using Fuzzy grouping, followed by the projection of an aerosanitary scenario using standardized approaches. To achieve the best results [41], predictors of equivalent quality but autonomous operation were selected. Four neural predictors were considered: MLP, RBFN, ERN, and SVM. The MLP employed the sigmoid activation function, RBF networks used a local (Gaussian) function with a distinct learning technique, and SVR utilized the kernel concept and a robust statistical learning method. Another study [42] used meteorological data as input to anticipate ozone concentrations in the atmosphere. The SVM algorithm was used to classify the data into specific categories. Using sigmoid as the activation function, BPNN with a genetic algorithm was employed to achieve higher forecast stability. It was noted that in cities with less traffic and more factories, this pattern might not perform well.

Challenges:

To deal with the uncertainty that affects the prediction performance.
To feed the model for better prediction, identify the relationship between the variables.
Lack of Spatio-Temporal consideration in the study.
To deal with sudden changes in input data.
Improve prediction accuracy with a time extension.
To improve computational time concerning complexity.

Due to the involvement of natural climate changes and human activities, the data on air pollution are inherently ambiguous. For better prediction, an enhanced algorithm is needed to handle the uncertainty of data. To predict the quantity or concentration of air pollutants, it is crucial to understand how each pollutant behaves and whether there is a positive, negative, or neutral relationship between those pollutants. A more profound comprehension of the link between pollutants can help determine which features should be added to the model. There is a link between the forecast performance and the time and area. The atmospheric behavior of the data gathered from various locations, such as the sea, mountains, deserts, etc., would vary. Significant fluctuations in the data over time may be encountered when working with time series data, such as air pollution data. As a result, if the algorithm used in the training phase is not well-trained, the prediction system may perform poorly. A sufficient amount of data is required for a prediction model to produce accurate results, and if this is not considered from the beginning, the system’s processing time may increase.

Multiple air pollutants prediction

Multiple Air Pollutants Prediction integrates advanced models and techniques to forecast contaminant concentrations accurately. Two models were incorporated, the Multi-Agent Evolutionary Genetic Algorithm (MAEGA) and the Nonlinear Auto Regressive Network with Exogenous Input (NARX) neural network, which achieved good performance with minimal error [43]. To estimate pollutant concentrations, Empirical Wavelet Transformation (EWT) was utilized to decompose the time series data, followed by MAEGA to optimize the weights in a multi-step process. The improved NARX neural network then forecasted the contaminants. A hybrid method combining differential evolution and random forest techniques was devised to forecast contaminants [44]. The differential evolution approach selected the candidate solution based on fitness value, and the random forest method created multiple trees to project pollutants using the candidate-chosen value. MLR and ANN, two approaches with linear and nonlinear behavior, were used to forecast pollutant concentrations [45]. A leave-one-out cross-validation process was employed to produce accurate results for prediction. The dataset spanned 13 years, with 60% used for training, 15 to 20% for validation, and the final two years for testing. Activation functions such as hyperbolic tangent were utilized for the input and hidden layers, while the output layer used a linear function. Seven air pollution variables, fifteen combinations of two parameters, and twenty combinations of three parameters were used to train the proposed algorithm and achieve reliable forecasting results [46]. It was claimed that selecting the right pollutant is crucial for the accuracy of the forecasting model. The SVM was utilized to forecast Rs and Rd (“Global and Diffuse Solar Radiation”), starting with no pollutants and incrementally increasing to one, two, and three pollutants. Fifty-nine potential predictors were analyzed to forecast various air contaminants [47]. Comparisons were made between air pollution data and meteorological factors, industrial emissions data and the population density of the study region, and topographical data and meteorological data. The preference for a Land Use Regression (LUR) model was demonstrated in the prediction process. An attempt was made to increase the accuracy of the proposed prediction system with the help of an independent predictor [48].

PM_2.5 was first predicted in 8 stations for the same day and the next day, followed by the prediction of ozone in two stations with 4 class variables. A Bayesian network was used to justify the performance of the multilabel classifier over the independent approach, which improved the AUC and success index. O₃, SO₂, CO, PM₁₀, and PM_2.5 were inputs, with responses indicating air quality levels from excellent to dangerous [49]. A Fuzzy Interference System (FIS) method was presented for assessing air quality, prioritizing parameters with a more significant damaging weight assignment. A membership function was implemented to cope with ambiguity, and the Analytic Hierarchy Process (AHP) was considered to analyze more problematic pollutants. Wavelet transformation with BPNN was performed on the input dataset to forecast air contaminants [50]. The relationship between air pollutants was considered since it could impact the accuracy of the suggested forecasting system. Stationary wavelet transformation was used to decompose the time series, and wavelet coefficients were then used to alter the weights in the BPNN model for future forecasting. Kumar and Pande [51] forecast air quality by analyzing six years of air pollution data from 23 Indian cities. The dataset is pre-processed, and feature selection is conducted using correlation analysis. Exploratory data analysis reveals hidden patterns and identifies key pollutants affecting the air quality index. A significant reduction in most pollutants was observed in 2020, attributed to the pandemic. Five machine learning models are used for prediction, with resampling techniques addressing data imbalance. Model outcomes, compared using standard metrics, show the Gaussian Naive Bayes model with the highest accuracy and the Support Vector Machine model with the lowest. Performance evaluation indicates that the XGBoost model has the strongest correlation between predicted and actual data.

Challenges:

To examine the predictive ability of a more advanced ANN model.
Consideration of knowledge-based systems with improved computational time.
Deeper analysis to optimize the interference for better assessment.
Better analysis of membership function.

The increasing amount of data necessitates using advanced artificial neural networks (ANNs), such as deep learning, two advanced machine learning model hybridization, etc., because the machine learning model performs poorly when data increases (particularly when considering meteorological and air pollution data). Knowledge-based systems are required to handle the processing time and performance of large volumes of data. The massive amount of data also necessitates parameter optimization for improved algorithm performance. By using the membership function, non-fuzzy inputs can be transformed into fuzzy ones. The membership function can represent a number between 0 and 1. An element can belong to a class if its value is closer to 1, and it is not to belong to a class if its value is closer to 0.

Single and multiple air pollutant prediction using deep learning approaches

Several methods have been utilized in the literature, ranging from statistical approaches to more recent developments in machine learning. Deep feedforward and recurrent neural networks are the most prevalent network topologies in the literature [52,53,54]. Deep learning has proven effective in discovering hidden relationships within complicated issues, and the more specialized architecture recurrent neural network has proven to be a helpful tool for time series prediction. Additionally, ensemble learning is advantageous since it is susceptible to noise and variation. Researchers have opted for various deep-learning methods for the prediction [55,56,57]. Air quality data is continuous, so an effective and efficient prediction model is needed. Unlike machine learning techniques, deep learning comes with the idea of learning the data long-term. With the help of inclusions of gates in LSTM [58] and GRU, the techniques are currently being used worldwide for predicting single air pollutants, multiple air pollutants, or air quality indexes. Table 7 shows the deep learning techniques used by researchers for single or multiple air pollutant predictions.

An innovative hybrid model was proposed [59] by combining the strengths of two deep learning approaches. Furthermore, to forecast PM concentrations, hybrid deep-learning approaches CNN-LSTM and CNN-GRU were compared to several individual methods, including LSTM and GRU. These training models considered hourly air pollution and meteorological data. The experiment findings demonstrated that the model could forecast PM pollutant concentrations for seven days. In five randomly selected regions, for PM₁₀ and PM_2.5 prediction, hybrid models outperformed single models with the lowest RMSE and MAE values. It was found that the PM10 prediction CNN–GRU model performed better, and the PM_2.5 CNN–LSTM model performed well. To increase the accuracy of PM_2.5 prediction, an early warning system was developed to predict PM_2.5 concentration based on extraction and optimization mechanisms [60]. To begin, a feedback VMD method was created to break down the sequence of PM_2.5 concentration, whereas fuzzy entropy was employed to rebuild comparable complexity patterns. Copula entropy was then utilized to identify the most significant influencing elements on PM_2.5. Following that, the rebuilt elements and persuading factors were fed into three separate training forecast schemes, i.e., LSTM, GRU, and TCN. Individual prediction model outputs were nonlinearly merged and optimized using Gaussian process regression and multi-objective grey wolf optimization. Lastly, the predictions of the various rebuilt elements were nonlinearly combined to get the concluding PM_2.5 predictions.

A graph convolutional temporal sliding long short-term memory (GT-LSTM) scheme [61] was given to predict various air contaminants. Both models were merged using a temporal sliding method. To study the dynamic changes of air pollution over time, LSTM was applied with a temporal sliding technique. Experiments showed that the scheme could extract high-level spatiotemporal characteristics more accurately and efficiently than the existing benchmark. To anticipate and analyze air pollution from Combined Cycle Power Plants, develop an innovative hybrid intelligence model combining the Multi-verse Optimization Algorithm (MVO) and LSTM [62]. In this model, the MVO approach was used to improve the LSTM parameters, which decreased the forecasting error. At the same time, the LSTM functioned as the forecasting engine to anticipate the amounts of NO₂ and SO₂ released by the Combined Cycle Power Plant. The model’s effectiveness was then assessed using actual data gathered from a combined cycle power plant in Kerman, Iran. Over five months (May to September 2019), wind speed measurements, air temperature, NO₂, and SO₂ levels were made every three hours.

A multi-point deep learning scheme was proposed [63] for highly dynamic air quality prediction based on convolutional LSTM (ConvLSTM). ConvLSTM designs integrated LSTM with CNN, allowing for the extraction of temporal and spatial data characteristics. Furthermore, uncertainty quantification schemes were constructed above the model’s architecture, and its performance was investigated. The ConvLSTM outperformed cutting-edge technologies of machine learning and deep learning schemes. A method [64] was developed to forecast air pollution and create an early warning system. This approach uniquely integrates advanced optimization, feature selection, and extraction methods. The PM_2.5 sequence was initially broken down into several smaller sequences using full ensemble empirical mode decomposition with adaptive noise. Subsequently, fuzzy entropy was employed to reconstruct the new components of the sub-sequences, resulting in varying levels of complexity. The influential elements of the reconstructed components were identified using the Max-Relevance and Min-Redundancy techniques. To forecast and nonlinearly integrate the rebuilt components, a dual-phase deep learning scheme was designed utilizing LSTM, which has been improved using the grey wolf optimization method. Ultimately, the proposed hybrid scheme successfully anticipated air pollution levels and provided an effective early alerting mechanism. This hybrid framework performed better than different approaches concerning accuracy, warning precision, and prediction resilience. The outcomes indicated that this hybrid framework could be valuable for forecasting air pollution and providing early alerts.

Beijing’s hourly PM_2.5 concentrations and meteorological data were utilized as input by researchers [65]. Using the GRU model, four distinct models were trained, each corresponding to one of the four seasons: spring, summer, autumn, and winter. The effectiveness of these models in predicting seasonal PM_2.5 levels was then evaluated using appropriate test sets for each season. The model’s prediction error and prediction accuracy were studied and compared after multiple trials. Continual modification of model parameters and the benefits of this technique were confirmed. The findings showed that the model’s prediction accuracy was high. The concentration of contaminants was forecasted by considering three primary areas [66]. First, the dataset was characterized, processed, and examined to identify crucial properties, including the autocorrelation function and evidence of non-Gaussianity through a normality test. A Structural Recurrent Neural Network (SRNN) and various RNN structures linked with memory cells were evaluated to produce an accurate sequential prediction model of PM_2.5 indoor health risk assessment.

In the end, many performance indicators were established to assess each prediction model’s performance. The researchers identified various work constraints, such as the difficulty in implementing dynamic hyper parameterization for offline structure training and capturing instantaneous disturbed PM_2.5 data using the Dynamic-Window size and sequence length. A subway station in Korea was used to forecast PM2.5 using a deep-learning approach. Telemonitoring systems were installed in the metro, waiting rooms, and subterranean regions. To forecast indoor PM_2.5 concentrations, point-to-point prediction was utilized to choose the top RNN models. After identifying the finest RNN for the research, the multi-sequence prediction was used, yielding the best prediction accuracy results. Statistical analysis revealed that PM_2.5 concentrations were higher between February and December compared to other months.

A data-driven model was employed to measure air pollution concentration in the atmosphere [67]. “Historical data, meteorological data, weather prediction data, and day-of-the-week data” from 36 monitoring stations were used in the operation. An LSTM-Fully Connected neural network focused on the spatiotemporal relationship to better predict PM_2.5 concentrations. A basic linear interpolation approach was used to fill in the missing data. The procedure was multidimensional and multi-step, with results validated using 12-fold cross-validation. While the model successfully predicted PM_2.5 concentration levels, it failed to capture abrupt changes in the input data. Meteorological and aerosol data from 1233 monitoring sites were collected, and Convolutional Long Short-Term Memory Extended (C-LSTME) [68] was used to extract spatial-temporal characteristics. The results were then validated using a 5-fold cross-validation method. The k-nearest neighbor technique was used to determine the influence of the center station on the nearby stations. After a one-hour to six-day time lag, the proposed method was compared to existing methodologies and demonstrated superior prediction results. The study accounted for the spatiotemporal relationship with a time lag to effectively anticipate contaminants.

The objective [69] was to address the abovementioned constraints by proposing a hybrid model to enhance PM_2.5 concentration forecasts. A Graph Convolutional Network (GCN) was utilized to identify spatial dependencies between stations, whereas an LSTM network was used to capture temporal dependencies within the data over various periods. The study utilized geographical and air quality data, including historical air quality records, meteorological factors, and geographical and temporal predictors, to visually represent spatio-temporal fluctuations. Airnet data was utilized to find the best-performing predictor in the RNN model [70]. The data was trained using several deep learning models, such as RNN, LSTM, and GRU, and the findings showed that GRU performed better than the other two predictors. Four layers were evaluated for each deep learning model, and RMSE and MAPE values for each prediction layer were considered. A novel deep learning-based system [71] was created to forecast future air quality by leveraging historical air quality and weather data. Granger Causality represented the spatial interdependence between two stations, with formulated relative stations and relative areas reflecting the spatial connection. The study divided air quality’s temporal features into short-term and long-term dependencies, both learned using LSTM. The method was tested using air quality data and one-year weather forecasts from Beijing.

A spatiotemporal correlation was used [72] to forecast air pollution concentrations with LSTM neural networks. A total of 20,196 records of historical data, meteorological data, and supplementary data were employed for the research. After standardizing the data, Missing data was addressed through a simple linear interpolation approach. Time-stamping was done using one-hot encoding. While the approach effectively improved multi-score predictions, it did not enhance prediction performance. To address issues like spatiotemporal instability and time-lag effects, the Deep multi-output LSTM (DM-LSTM) technique [73] was combined with three deep learning algorithms: mini-batch gradient descent, dropout neuron, and L2 regularization. The DM-LSTM model was created with numerous future stages in mind. The correlation between the input and output variables was ascertained using the Kendall-tau coefficient.

The analysis of atmospheric contaminants was conducted [74] using time series regression forecasting at two major monitoring stations in India, Shadipur and Anandvihar, out of a total of 20 sites during the period from 2016 to 2017. The examination involved both descriptive and predictive analysis utilizing Rstudio and Tableau. Another study [75] focused on determining atmospheric contaminant concentrations with a Deep Learning system using datasets from Denmark and Romania collected between 2013 and 2015. The model was trained on 69.5% of the data from 17,568 samples, with the remainder used for testing. This prediction model consisted of four steps: sensor data input, computation by the deep learning model, data labeling, and final prediction by the decision unit. The k-means algorithm with the Mahout library was utilized to improve air quality prediction using the Citypulse dataset as a benchmark for Romania. A new dataset was generated by combining five air pollution data components with four meteorological data components. The fuzzy clustering technique was recommended for future research to achieve better results. A study [76] employed a two-phase decomposition approach combining CEEMD and VMD to manage high accuracy and dispersed signal information. Following this decomposition, the Extreme Learning Machine (ELM) with Differential Evolution (DE) was utilized to achieve accurate prediction performance. The training data comprised China’s data from July 1, 2014, to May 31, 2016, with testing data from June 1, 2016, to June 30, 2016. Two models [77] for predicting air pollutant concentrations were introduced: the EMD-SVR and the EMD-IMFs model. These models were used to model AQI data, and S-ARIMA was utilized to forecast IMFs.

However, the specific pollutant concentrations impacting the environment were not defined. The Mean Absolute Percentage Error (MAPE) was used to identify the best value by comparing the two proposed hybrid models with six others, using China data between June 2014 and August 2015. In previous studies, only hourly data was used for forecasting, and yearly data was often overlooked. To address this, using the Grey model, a method was proposed to predict yearly data on air pollutants such as PM_2.5, PM₁₀, SO₂, and O₃ [78]. Irulegi et al. [79] employed the NetZero energy-building technique to measure room temperature at various intervals, with sensors installed in different locations. The readings over a week showed significant temperature variations on only two days. A system [80] for predicting future levels of air pollutants (SO₂, NO₂, O₃, PM_2.5) was suggested based on synoptic forecasting, using data from 20 monitoring sites in Israel collected between 2002 and 2006. The processing utilized a numerical weather prediction categorization model, although the method’s main flaw was constant pollutant coefficients. A time series model (ARIMA) was used to estimate SO₂ and PM levels accurately, analyzing data from Hawaii obtained between January 1, 2014, and August 21, 2018, considering wind direction and speed [81].

The impact of noise on a two-DOF pitch-plunge aeroelastic system was investigated. Variations in the inlet flow velocity were modeled in [82] using the Ornstein-Uhlenbeck (O-U) stochastic process. The study utilized a 2D airfoil with linear spring-based aeroelastic characteristics to investigate Limit Cycle Oscillations (LCOs) in dynamic stall circumstances. According to the survey, adding stochastic noise causes notable qualitative changes in the Limit Cycle Oscillations (LCOs) regime and reveals intrinsic volatility absent from deterministic models. A more accurate representation of environmental influences was obtained by introducing noise to fluid velocity, demonstrating that fluctuations in these parameters cause LCO limit values to oscillate around the mean.

The researchers in [83] used Artificial Neural Network (ANN), Gaussian process regression (GPR), Decision Tree (DT), Ensemble Learning (EL), Support Vector Machine (SVM), and Linear Regression (LR) algorithms, including optimized versions of GPR, EL, DT, and SVM, to assess the amount of CO₂ in an office space. It used 169 real-time data sets that included variables like air quality index, temperature, wind speed, relative humidity, occupancy, and area per person. 30% of the data was used for testing and 70% for training after it had been scaled and normalized between 0 and 1. A 5-fold cross-validation procedure was performed. With R, RMSE, MAE, NS, and 20-index values of 0.98874, 4.20068 ppm, 3.35098 ppm, 0.9817, and 1, respectively, the results demonstrated that optimized GPR achieves excellent accuracy. The model can help to estimate CO₂ levels and their effect on health.

Here are some suggestions and guidelines for extending the utility of these models to manage and effectively utilize large datasets, especially in scenarios involving rapid data influx:

Assess each model’s current scalability to determine how well it can handle increased data volumes without a significant loss in performance or speed. Outline the architectural limits of simpler models and identify whether enhancements or replacements are necessary for big data applications.
Implement data preprocessing steps such as data normalization, dimensionality reduction, and feature selection to manage large datasets more efficiently before they are input into the model. Use data batching and mini-batch gradient descent for training models on large datasets to manage memory resources effectively and speed up the learning process.
Upgrade hardware or migrate to cloud-based solutions to ensure that the infrastructure can handle the computational demands of big data analytics. Utilize distributed computing environments like Apache Hadoop or Spark to process large datasets effectively, mainly when dealing with real-time data streams.
Incorporate streaming algorithms that can process data in real-time, which is crucial for scenarios involving sudden data rushes. Adapt models to use incremental learning techniques, where the model updates its parameters continuously as new data arrives without the need for retraining from scratch.
Evaluate model performance under varying conditions of data volume, velocity, and variety to identify potential performance degradation or failures. Test models against scenarios of sudden data influx to gauge their responsiveness and accuracy in real-time predictions.
Conduct stress tests to identify potential bottlenecks or weaknesses in data processing and model predictions. Discuss possible model enhancements such as hybrid architectures combining simpler models with more complex algorithms to improve handling and predictive accuracy with large datasets.
Recommend integrating advanced machine learning techniques such as deep learning and ensemble methods, which are generally more adept at handling complex and large-scale data. Suggest employing adaptive algorithms that can adjust to changes in data patterns over time, enhancing model flexibility and durability against the variability in big data streams.

Need for prediction

Air pollution is impacting countries all over the world. Delhi is India’s capital city and one of the most polluted. Air pollution in Delhi and the NCR is often so terrible that even a slight improvement in air quality won’t significantly impact it. A variety of sources causes smog and air pollution in Delhi and the NCR. Emissions from motor vehicles, industry, and building activities are among them. During the winter, the wind pattern across northern India leaves New Delhi and the NCR sensitive to air patterns. The reasons behind the increase in air pollution in Delhi, as well as the efforts taken by the government to address the issue, are detailed below.

a.
Border Sharing: Delhi has its border with Punjab, Haryana, and Uttar Pradesh, and rice stubble burning in any of these towns contributes to Delhi’s pollution levels. The climatic elements play a vital role in exacerbating the pollution problem. For example, any unpleasant activity done by any city affects the pollution level of neighboring cities. Stubble burning is cleansing an agricultural field by burning the residue left on the soil after harvesting to prepare it for the next seeding cycle. While it is a significant factor in air pollution, it is not the sole contributor.

Step(s) taken by government The government is taking significant measures to address the problem of stubble burning. The authorities have created a public awareness campaign to educate people about the dangers of crop burning. New frameworks and action plans for preventing and managing stubble burning are also being established, like the Turbo Happy Seeder (THS) machine, Pusa bio decomposer, and Custom hiring centers (CHSs) are some solutions to the problem. The government has raised funds to purchase crop residue management equipment [84]. In addition, breaking the regulations can result in significant fines, prison time, or both [85].
b.
Vehicular Emission: One of society’s significant difficulties today is pollution due to vehicle emissions. Any vehicle, such as a car, bus, truck, scooter, etc., can contribute to air pollution. The most crucial consideration is which sort of vehicle emits the least amount of pollution. Compared to other modes of transportation, the bus is a viable option since it consumes less fuel per person.

Step(s) taken by government The government banned the sale of diesel personal vehicles with engines larger than 2000 cc and levied a “green tax” of 4% on more prominent cars and 2.5% on compact cars. The authorities considered steps such as introducing Bharat Stage IV and Bharat Stage VI norms, electrification of vehicles, and the usage of LNG-powered buses to tackle the issue [86]. On January 1, 2016, the Delhi government welcomed the odd-even scheme, under which vehicles with odd plates were allowed to park, drive, and purchase gasoline, among other things [87,88,89], resulting in a reduction in traffic congestion, a slight decrease in pollution levels, and a boost in public transportation demand, including more buses and increased metro frequency. Essential actions include expanding green areas, implementing automated road sweeping, particularly close to building sites, and starting a public awareness campaign to encourage non-motorized mobility within a 5 to 6-km radius of the city.
c.
Seasons: In winter, dust particles and airborne contaminants become stationary. The lack of wind causes these pollutants to remain trapped in the atmosphere, impacting meteorological conditions and forming smog. In the winter, coal combustion raises indoor and outdoor pollution levels [22,23,24,25, 27]. If meteorological factors have a negative relationship with air pollution, they also have a positive relationship.

Step(s) taken by government The government advises citizens to avoid going outside in the early morning and late evening when pollution levels are high. Several groups have developed products that can be purchased in stores or online to take better safeguards. Air purifiers can be used to combat interior pollution, or indoor plants can be installed within the house. For protection against outdoor pollution, masks are readily available on the market.
d.
Overpopulation: In 2019, Delhi’s population was 33 million, making it the world’s fifth most populous metropolis. As the population grows, so does the need for food, land, and transportation, all of which impact pollution directly or indirectly.

Step(s) taken by government The government has proposed various strategies to address the growing population problem. Programs to raise awareness have been launched and are spreading across the country [90].
e.
Crackers burning on Festivals: India is characterized by diverse ethnicities, religions, and languages. Where people joyfully celebrate each holiday. Every year during Diwali (the Festival of Lights), people ignite many crackers, contributing to pollution.

Step(s) taken by governmentThe Government has introduced Green Cracker in association with CSIR (Council of Scientific and Industrial Research) [91]. Every year, the government establishes a particular time for cracker burning and awareness of the celebration, encouraging people to celebrate the festival of light rather than with fireworks. Challans/fines, a ban on cracker burning by cities, awareness against crackers, and practical hours are some steps the government considers for the betterment of the people [92].

Limitation of the work

Despite many researchers on air pollutant prediction, it is still challenging to collect datasets; datasets like air pollutants, meteorological data, traffic data, and satellite data are still lacking.
The lack of a relationship between air pollutants and weather data results in a low prediction rate. Researchers should focus more on the correlation factors as air pollution is affected by many factors.
Predictions should be made regarding the geographical and temporal scope. All other regions should be considered to make the prediction more accurate.
Uncertainty in data can result in the reliability of prediction.

Conclusion and future scope

Air pollution is a growing problem that significantly impacts current and future generations. Long-term solutions are necessary to combat this issue, not just during peak pollution months but throughout the year. Addressing major contributors to air pollution requires infrastructure for electric vehicles, enhanced public transit networks, and prioritization of renewable energy sources year-round.

This paper comprehensively reviews air quality, covering various pollutants, meteorological factors, and statistical data on death rates due to air pollution. It examines the impact of air pollution on three broad areas of society, offering readers a deeper understanding of its consequences. Additionally, the study explores several approaches to predicting air quality using machine learning and deep learning, highlighting the challenges in each methodology to guide future research.

While traditional artificial intelligence outperforms statistical methods, hybrid models show superior performance. The accuracy and robustness of chemical weather and air quality forecasts are improved via ensemble approaches and data decomposition. Effective weather prediction requires integrating spatiotemporal components, meteorological features, and geographical deliberations, which many researchers have neglected. Different techniques are needed to forecast air pollution based on contaminants and regions since no single method can guarantee accurate predictions. Large datasets can make training computationally expensive, while small datasets may reduce accuracy. Despite these challenges, AI technology is becoming a preferred alternative to traditional mathematical techniques due to its quick and reliable response.

In addition to the advantages mentioned above, this paper has some theoretical and practical implications. The study provides a deeper understanding of the mechanisms behind air pollution formation, dispersion, and transformation. Introducing new methodologies or data analysis techniques may push theoretical boundaries and set new standards in the field. It may enhance theories regarding the broader environmental impacts of air pollution. It may help regulatory agencies better monitor and enforce compliance with air quality standards, ensuring that industries and other sources adhere to regulations. The research may inform the design of public health campaigns and protective measures such as air quality alerts and recommendations for vulnerable populations. The research might drive innovation in air quality monitoring technologies, leading to more accurate and real-time data collection tools. The study can contribute to educational efforts to raise awareness about air pollution and its health impacts, encouraging behavioral changes and community action. The research might encourage collaborations between researchers, policymakers, industry leaders, and non-governmental organizations to address air pollution challenges more effectively.

Most studies reviewed used diverse performance metrics to evaluate model prediction accuracy, yet more research is needed to explore other aspects of model efficacy. Additionally, researchers often overlook other environmental contaminants impacting human health.

Despite significant progress in air quality prediction, many obstacles remain. Machine learning and deep learning advances yield good predictions, but uncertainty can undermine these results. For instance, during the COVID-19 pandemic, reduced industrial and transportation activities led to healthier air quality. Conversely, cities with poor air quality experienced higher health risks. Understanding data patterns is crucial for developing effective algorithms and raising public awareness in critical situations.

Future research should address the following gaps identified in this study:

Insufficient research addressing uncertainty factors affecting prediction outcomes highlights the need for enhanced models to incorporate real-time data updates and feedback mechanisms.
There is a need for a more inclusive investigation of meteorological and pollutant parameters in forecasting accuracy, which should include a thorough investigation of the formation and behavior of secondary pollutants to understand their complex dynamics.
There are few analyses on the effectiveness of sophisticated Artificial Neural Network (ANN) models in improving computational efficiency in complex prediction systems.
Satellite data, meteorological data, and spatial-temporal resolution of air pollution forecasts should be utilized to create a comprehensive and accurate prediction model.

Data availability

No datasets were generated or analysed during the current study.

References

Total Population by Country. 2019 [internet]. [cited 2019 December 12]. http://worldpopulationreview.com/countries/
India Population. (2019) - Worldometers [Internet]. [cited 2019 December 12]. https://www.worldometers.info/world-population/india-population/
Air pollution claims over 10 lakh. lives every year and yet we tend to ignore it [internet]. [cited 2022 December 2]. https://www.firstpost.com/health/air-pollution-claims-over-10-lakh-lives-every-year-and-yet-we-tend-to-ignore-it-10051431.html
Air. pollution linked to 12.4L deaths in India in ’17: Report | India News - Times of India [Internet]. [cited 2022 December 2]. https://timesofindia.indiatimes.com/india/air-pollution-linked-to-12-4l-deaths-in-india-in-17-report/articleshow/66978223.cms
12.4 lakh. deaths reported in India due to air pollution - India Today [Internet]. [cited 2022 December 2]. https://www.indiatoday.in/education-today/gk-current-affairs/story/12-4-lakh-deaths-reported-in-india-due-to-air-pollution-1404411-2018-12-07
Explained. India topped air pollution death toll in 2019, says report [internet]. https://indianexpress.com/article/explained/explained-india-topped-air-pollution-death-toll-2019-7922560/
120,000 Deaths in India Due to Air Pollution. in 2020 [internet]. https://smartairfilters.com/en/blog/120000-deaths-in-india-due-to-air-pollution-in-2020/
WHO | Ambient air pollution: Health impacts [internet]. WHO, World O. 2018 [cited 2019 December 10]. https://www.who.int/teams/environment-climate-change-and-health/air-quality-energy-and-health/sectoral-interventions/ambient-air-pollution/health-risks#:~:text=4.2%20million%20people%20die%20prematurely,and%206%25%20to%20lung%20cancer
Kazemi Z, Jonidi Jafari A, Farzadkia M, Amini P, Kermani M. Evaluating the mortality and health rate caused by the PM2.5 pollutant in the air of several important Iranian cities and evaluating the effect of variables with a linear time series model. Heliyon. 2024;10:e27862.
Article Google Scholar
WHO | Household air pollution. Health impacts [internet]. [cited 2019 December 10]. https://www.who.int/news-room/fact-sheets/detail/household-air-pollution-and-health#:~:text=The%20combined%20effects%20of%20ambient,(COPD)%20and%20lung%20cancer
Bai L, He Z, Li C, Chen Z. Investigation of yearly indoor/outdoor PM2. 5 levels in the perspectives of health impacts and air pollution control: Case study in Changchun, in the northeast of China. Sustain Cities Soc. 2020;53:101871.
Cincinelli A, Martellini T. Indoor air quality and health. Int J Environ Res Public Health. MDPI AG; 2017.
Alves CA, Vicente ED, Evtyugina M, Vicente AM, Nunes T, Lucarelli F, Calzolai G, Nava S, Calvo AI, del Blanco Alegre C, Oduber F. Indoor and outdoor air quality: A university cafeteria as a case study. Atmospheric Pollution Res. 2020;11(3):531–44.
Alves C, Nunes T, Silva J, Duarte M. Comfort parameters and Particulate Matter (PM 10 and PM 2.5) in School classrooms and Outdoor Air. Aerosol Air Qual Res. 2013;13:1521–35.
Article Google Scholar
Buczyńska AJ, Krata A, Van Grieken R, Brown A, Polezer G, De Wael K, et al. Composition of PM2.5 and PM1 on high and low pollution event days and its relation to indoor air quality in a home for the elderly. Sci Total Environ. 2014;490:134–43.
Article Google Scholar
Bennett J, Davy P, Trompetter B, Wang Y, Pierse N, Boulic M, et al. Sources of indoor air pollution at a New Zealand urban primary school; a case study. Atmos Pollut Res. 2019;10:435–44.
Article Google Scholar
Scibor M. Are we safe inside? Indoor air quality in relation to outdoor concentration of PM10 and PM2.5 and to characteristics of homes. Sustain Cities Soc. 2019;48.
Chamseddine A, Alameddine I, Hatzopoulou M, El-Fadel M. Seasonal variation of air quality in hospitals with indoor–outdoor correlations. Build Environ. 2019;148:689–700.
Article Google Scholar
Malhotra M, Aulakh IK, Kaur N, Aulakh NS. Air Pollution Monitoring Through Arduino Uno. Advances in Intelligent Systems and Computing. 2020.
Kumar A, Malhotra S, Kaur DP, Gupta L. Weather Monitoring and Air Quality Prediction using Machine Learning. 2022 1st International Conference on Computational Science and Technology, ICCST 2022 - Proceedings. Institute of Electrical and Electronics Engineers Inc.; 2022. pp. 364–8.
Pandithurai O, Bharathiraja N, Pradeepa K, Meenakshi D, Kathiravan M, Vinoth Kumar M. Air Pollution Prediction using Supervised Machine Learning Technique. Proceedings of the 3rd International Conference on Artificial Intelligence and Smart Energy, ICAIS 2023. Institute of Electrical and Electronics Engineers Inc.; 2023. pp. 542–6.
Wang J, Ogawa S. Effects of meteorological conditions on PM2.5 concentrations in Nagasaki, Japan. Int J Environ Res Public Health. 2015;12:9089–101.
Article Google Scholar
Giri D, Krishna Murthy V, Adhikary PR. The influence of meteorological conditions on PM10 concentrations in Kathmandu Valley. Int J Environ Res. 2008;2:49–60.
Google Scholar
Zyromski A, Biniak-Pieróg M, Burszta-Adamiak E, Zamiar Z. Evaluation of relationship between air pollutant concentration and meteorological elements in winter months. J Water Land Dev. 2014;22:25–32.
Article Google Scholar
Yang Q, Yuan Q, Li T, Shen H, Zhang L. The relationships between PM2. 5 and meteorological factors in China: seasonal and regional variations. Int J Environ Res Public Health. 2017;14(12):1510.
Malhotra M, Aulakh IK. Meteorological Factors Correlation with Air pollutants: a Case Study in Delhi. Int J Environ Sci Dev. 2023;14:91–105.
Article Google Scholar
Kayes I, Shahriar SA, Hasan K, Akhter M, Kabir MM, Salam MA. The relationships between meteorological parameters and air pollutants in an urban environment. Global J Environ Sci Manage. 2019;5:265–78.
Google Scholar
Lin X, Fu Y, Peng DZ, Liu C-H, Chu M, Chen Z, et al. CFD- and BPNN- based investigation and prediction of air pollutant dispersion in urban environment. Sustain Cities Soc. 2024;100:105029.
Article Google Scholar
Yang F, Huang G. An optimized decomposition integration model for deterministic and probabilistic air pollutant concentration prediction considering influencing factors. Atmos Pollut Res. 2024;15:102144.
Article Google Scholar
Shaban WM, Dongxi X, Daef KS, Elbaz K. Real-time early warning and the prediction of air pollutants for sustainable development in smart cities. Atmos Pollut Res. 2024;15:102162.
Article Google Scholar
Monteiro TO, Alves PAA da S, de AN, Barradas Filho AO, Villa-Vélez HA, Cruz G. Estimation of the main air pollutants from different biomasses under combustion atmospheres by artificial neural networks. Chemosphere. 2024;352:141484.
Zhou Y, Chang FJ, Chang LC, Kao IF, Wang YS, Kang CC. Multi-output support vector machine for regional multi-step-ahead PM 2.5 forecasting. Sci Total Environ. 2019;651:230–40.
Article Google Scholar
Ahani IK, Salari M, Shadman A. Statistical models for multi-step-ahead forecasting of fine particulate matter in urban areas. Atmospheric Pollution Res. 2019;10(3):689–700.
Cabaneros SMLS, Calautit JKS, Hughes BR. Hybrid Artificial Neural Network Models for Effective Prediction and Mitigation of Urban Roadside NO2 Pollution. Energy Procedia. 2017;142:3524–30.
Article Google Scholar
Wang P, Zhang H, Qin Z, Zhang G. A novel hybrid-Garch model based on ARIMA and SVM for PM2. 5 concentrations forecasting. Atmospheric Pollution Research. 2017;8(5):850-60.
Catalano M, Galatioto F, Bell M, Namdeo A, Bergantino AS. Environmental Science & Policy Improving the prediction of air pollution peak episodes generated by urban transport networks. Environ Sci Policy. 2016;60:69–83.
Article Google Scholar
Feng X, Li Q, Zhu Y, Hou J, Jin L, Wang J. Artificial neural networks forecasting of PM2.5 pollution using air mass trajectory based geographic model and wavelet transformation. Atmos Environ. 2015;107:118–28.
Article Google Scholar
Cortina–Januchs MG, Quintanilla–Dominguez J, Vega–Corona A, Andina D. Development of a model for forecasting of PM10 concentrations in Salamanca, Mexico. Atmos Pollut Res. 2015;6:626–34.
Article Google Scholar
Wahid HA, Ha QP, Duc H, Azzi M. Neural network-based meta-modelling approach for estimating spatial distribution of air pollutant levels. Applied Soft Comput. 2013;13(10):4087–96.
Domańska D, Wojtylak M. Application of fuzzy time series models for forecasting pollution concentrations. Expert Syst Appl. 2012;39:7673–9.
Article Google Scholar
Siwek K, Osowski S. Engineering Applications of Artificial Intelligence improving the accuracy of prediction of PM 10 pollution by the wavelet transformation and an ensemble of neural predictors. Eng Appl Artif Intell. 2012;25:1246–58.
Article Google Scholar
Feng Y, Zhang W, Sun D, Zhang L. Ozone concentration forecast method based on genetic algorithm optimized back propagation neural networks and support vector machine data classi fi cation. Atmos Environ. 2011;45:1979–85.
Article Google Scholar
Liu H, Wu H, Lv X, Ren Z, Liu M, Li Y, et al. An intelligent hybrid model for air pollutant concentrations forecasting: case of Beijing in China. Sustain Cities Soc. 2019;47:101471.
Article Google Scholar
Kumar D. ScienceDirect ScienceDirect Evolving Differential evolution method with random forest for Evolving Differential evolution method with random forest for prediction of Air Pollution prediction of Air Pollution. Procedia Comput Sci. 2018;132:824–33.
Article Google Scholar
Alimissis A, Philippopoulos K, Tzanis CG, Deligiorgi D. Spatial estimation of urban air pollution with the use of arti fi cial neural network models. Atmos Environ. 2018;191:205–13.
Article Google Scholar
Fan J, Wu L, Zhang F, Cai H, Wang X, Lu X. Evaluating the e ff ect of air pollution on global and di ff use solar radiation prediction using support vector machine modeling based on sunshine duration and air temperature. Renew Sustain Energy Rev. 2018;94:732–47.
Article Google Scholar
Huang L, Zhang C, Bi J. Development of land use regression models for PM 2. 5, SO 2, 2 and O 3 in. Environ Res. 2017;158:542–52.
Article Google Scholar
Corani G, Scanagatta M. Air pollution prediction via multilabel classification. Environ Model Softw. 2016;80:259–64.
Article Google Scholar
Olvera-garcía MÁ, Carbajal-hernández JJ, Sánchez-fernández LP. Hernández-bautista I. Ecological Informatics Air quality assessment using a weighted fuzzy inference system. Ecol Inf. 2016;33:57–74.
Article Google Scholar
Bai Y, Li Y, Wang X, Xie J, Li C. Air pollutants concentrations forecasting using back propagation neural network based on wavelet decomposition with meteorological conditions. Atmospheric pollution res. 2016;7(3):557–66.
Kumar K, Pande BP. Air pollution prediction with machine learning: a case study of Indian cities. International Journal of Environmental Science and Technology [Internet]. 2023;20:5333–48. https://link.springer.com/https://doi.org/10.1007/s13762-022-04241-5
Lin Y-C, Lin Y-T, Chen C-R, Lai C-Y. Meteorological and traffic effects on air pollutants using bayesian networks and deep learning. J Environ Sci. 2025;152:54–70.
Article Google Scholar
Wu Z, Tian Y, Li M, Wang B, Quan Y, Liu J. Prediction of air pollutant concentrations based on the long short-term memory neural network. J Hazard Mater. 2024;465:133099.
Article Google Scholar
Li D, Wang J, Tian D, Chen C, Xiao X, Wang L, et al. Residual neural network with spatiotemporal attention integrated with temporal self-attention based on long short-term memory network for air pollutant concentration prediction. Atmos Environ. 2024;329:120531.
Article Google Scholar
Wu Z, Tian Y, Li M, Wang B, Quan Y, Liu J. Prediction of air pollutant concentrations based on the long short-term memory neural network. J Hazard Mater. 2023;133099.
Ma Z, Wang B, Luo W, Jiang J, Liu D, Wei H, et al. Air pollutant prediction model based on transfer learning two-stage attention mechanism. Sci Rep. 2024;14:7385.
Article Google Scholar
Bekkar A, Hssina B, Douzi S, Douzi K. Air-pollution prediction in smart city, deep learning approach. J Big Data [Internet]. 2021;8:161. https://journalofbigdata.springeropen.com/articles/https://doi.org/10.1186/s40537-021-00548-1
Drewil GI, Al-Bahadili RJ. Air pollution prediction using LSTM deep learning and metaheuristics algorithms. Measurement: Sensors [Internet]. 2022;24:100546. https://linkinghub.elsevier.com/retrieve/pii/S2665917422001805
Yang G, Lee H, Lee G. A hybrid deep learning model to forecast particulate matter concentration levels in Seoul, South Korea. Atmosphere. 2020;11(4):348.
Wang J, Xu W, Zhang Y, Dong J. A novel air quality prediction and early warning system based on combined model of optimal feature extraction and intelligent optimization. Chaos Solitons Fractals. 2022;158:112098.
Article Google Scholar
Mao W, Jiao L, Wang W, Wang J, Tong X, Zhao S. A hybrid integrated deep learning model for predicting various air pollutants. GIsci Remote Sens. 2021;58:1395–412.
Article Google Scholar
Heydari A, Majidi Nezhad M, Astiaso Garcia D, Keynia F, De Santoli L. Air pollution forecasting application based on deep learning model and optimization algorithm. Clean Technol Environ Policy. 2022;24:607–21.
Article Google Scholar
Mokhtari I, Bechkit W, Rivano H, Yaici MR. Uncertainty-aware deep learning architectures for highly dynamic air Quality Prediction. IEEE Access. 2021;9:14765–78.
Article Google Scholar
Wang J, Xu W, Dong J, Zhang Y. Two-stage deep learning hybrid framework based on multi-factor multi-scale and intelligent optimization for air pollutant prediction and early warning. Stoch Env Res Risk Assess. 2022;36:3417–37.
Article Google Scholar
Zhou X, Xu J, Zeng P, Meng X. Air Pollutant Concentration Prediction based on GRU Method. J Phys Conf Ser. 2019;1168.
Loy-Benitez J, Vilela P, Li Q, Yoo C. Sequential prediction of quantitative health risk assessment for the fine particulate matter in an underground facility using deep recurrent neural networks. Ecotoxicol Environ Saf [Internet]. 2019;169:316–24. https://linkinghub.elsevier.com/retrieve/pii/S0147651318311606
Zhao J, Deng F, Cai Y, Chen J. Chemosphere Long short-term memory - fully connected (LSTM-FC) neural network for PM 2. 5 concentration prediction. Chemosphere. 2019;220:486–92.
Article Google Scholar
Wen C, Liu S, Yao X, Peng L, Li X, Hu Y, et al. Science of the total environment a novel spatiotemporal convolutional long short-term neural network for air pollution prediction. Sci Total Environ. 2019;654:1091–9.
Article Google Scholar
Qi Y, Li Q, Karimian H, Liu D. A hybrid model for spatiotemporal forecasting of PM 2.5 based on graph convolutional neural network and long short-term memory. Sci Total Environ. 2019;664:1–10.
Article Google Scholar
Athira V, Geetha P, Vinayakumar R, Soman KP. ScienceDirect ScienceDirect DeepAirNet: applying recurrent networks for Air Quality Prediction. Procedia Comput Sci. 2018;132:1394–403.
Article Google Scholar
Wang J, Song G, Neurocomputing A, D eep. patial- T emporal E nsemble M odel for a ir Q uality P rediction. Neurocomputing. 2018;314:198–206.
Article Google Scholar
Li X, Peng L, Yao X, Cui S, Hu Y, You C. Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation *. Environ Pollut. 2017;231:997–1004.
Article Google Scholar
Zhou Y, Chang F, Chang L, Kao I, Wang Y. Explore a deep learning multi-output neural network for regional multi-step-ahead air quality forecasts. J Clean Prod. 2019;209:134–45.
Article Google Scholar
Sharma N, Taneja S, Sagar V, Bhatt A. Forecasting air pollution load in Delhi using data analysis tools. Procedia Comput Sci. Elsevier BV; 2018. pp. 1077–85.
Kök I, Şimşek MU, Özdemir S. A deep learning model for air quality prediction in smart cities. Proceedings – 2017 IEEE International Conference on Big Data, Big Data. 2017. 2017. pp. 1983–90.
Wang D, Wei S, Luo H, Yue C, Grunder O. Science of the total environment a novel hybrid model for air quality index forecasting based on two-phase decomposition technique and modi fi ed extreme learning machine. Sci Total Environ. 2017;580:719–33.
Article Google Scholar
Zhu S, Lian X, Liu H, Hu J, Wang Y, Che J. Daily air quality index forecasting with hybrid models: a case in. Environ Pollut. 2017;231:1232–44.
Article Google Scholar
Wu L, Li N, Yang Y. Prediction of air quality indicators for the Beijing-Tianjin-Hebei region. 2018;196:682–7.
Irulegi O, Serra A, Hernández R. Data in brief data on records of indoor temperature and relative humidity in a University building. Data Brief. 2017;13:248–52.
Article Google Scholar
Broday DM, Alpert P. Exploring the applicability of future air quality predictions based on synoptic system forecasts. Environ Pollut. 2012;166:65–74.
Article Google Scholar
Reikard G. Atmospheric Environment: X volcanic emissions and air pollution : forecasts from time series models. Atmos Environ X. 2019;1:100001.
Google Scholar
Ketseas D. Stochastic response of an Airfoil and its effects on Lco’s Behavior under Stall Flutter Regime. Int J Math Stat Comput Sci. 2024;2:168–72.
Kapoor NR, Kumar A, Kumar A, Kumar A, Mohammed MA, Kumar K et al. Machine Learning-Based CO2Prediction for Office Room: A Pilot Study. Wirel Commun Mob Comput. 2022;2022.
Measures to Reduce Pollution Due to Stubble Burning. 2020. pp. 151–6.
All about stubble. burning, its alternatives and steps taken by Centre and state govts.
A look at. key govt initiatives to keep air pollution from vehicles under check.
Odd–even rationing. - Wikipedia [Internet]. [cited 2019 December 9]. https://en.wikipedia.org/wiki/Odd–even_rationing
What is Delhi’s new odd-even vehicle rule all about? Where did it come from? India Today. 2015.
Pollution in Delhi dips 62%. in one day, thanks to high wind speed and odd-even [internet]. [cited 2019 December 10]. https://www.indiatoday.in/diu/story/pollution-in-delhi-dips-62-in-one-day-thanks-to-high-wind-speed-and-odd-even-1615717-2019-11-04
Population Control [Internet]. [cited 2022 June 7]. https://pib.gov.in/newsite/PrintRelease.aspx?relid=194837
India launches. Green Crackers in its bid to curb air pollution [internet]. [cited 2022 June 7]. https://pib.gov.in/newsite/PrintRelease.aspx?relid=193646
No Firecracker, Only Green Crackers This Diwali. : How These States Taking Measures to Control Air Pollution [Internet]. [cited 2022 June 7]. https://www.india.com/news/india/no-firecracker-only-green-crackers-this-diwali-how-these-states-taking-measures-to-control-air-pollution-5072307/

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful suggestions, which greatly enhanced the quality of this article.

Funding

This work was supported in part by the National Science and Technology Council under Grant NSC 111-2410-H-167-005-MY2 and NSC 112-2634-F-005-001-MBK.

Author information

Authors and Affiliations

Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, Punjab, India
Meenakshi Malhotra & Savita Walia
Department of Information Computer Science and Information Engineering, National Chin-Yi University of Technology, Taichung, 411, Taiwan
Chia-Chen Lin
University Institute of Engineering and Technology, Panjab University, Chandigarh, India
Inderdeep Kaur Aulakh
Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, 38541, Republic of Korea
Saurabh Agarwal

Authors

Meenakshi Malhotra
View author publications
You can also search for this author in PubMed Google Scholar
Savita Walia
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Chen Lin
View author publications
You can also search for this author in PubMed Google Scholar
Inderdeep Kaur Aulakh
View author publications
You can also search for this author in PubMed Google Scholar
Saurabh Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, M.M., S.W., C.L, I.K.A., and S.A.; writing—original draft preparation, M.M., S.W., C.L, I.K.A., and S.A.; writing—review and editing, M.M., S.W., C.L, I.K.A., and S.A.; All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Chia-Chen Lin or Saurabh Agarwal.

Ethics declarations

Institutional review board statement

Not applicable.

Informed consent

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Malhotra, M., Walia, S., Lin, CC. et al. A systematic scrutiny of artificial intelligence-based air pollution prediction techniques, challenges, and viable solutions. J Big Data 11, 142 (2024). https://doi.org/10.1186/s40537-024-01002-8

Download citation

Received: 10 June 2024
Accepted: 22 September 2024
Published: 09 October 2024
DOI: https://doi.org/10.1186/s40537-024-01002-8

A systematic scrutiny of artificial intelligence-based air pollution prediction techniques, challenges, and viable solutions

Abstract

Introduction

Air pollutants

Pollution affects three broad areas

Research questions (RQs)

RQ1:

RQ2:

RQ3:

RQ4:

Literature review on air quality prediction

Single air pollutant prediction

Multiple air pollutants prediction

Single and multiple air pollutant prediction using deep learning approaches

Need for prediction

Limitation of the work

Conclusion and future scope

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Institutional review board statement

Informed consent

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords