Estimating the carbon content of oceans using satellite sensor data

The impact of chemical processes in ocean surface waters is far-reaching. Recently, increased significance has been placed on the concentration of Carbon and its compounds and the effects these may have on climate change. Remote-sensing enables near real-time measurement of key sea-surface data which can be used to estimate Carbon levels. We illustrate with the use of hybrid Satellite sensor data. To validate our results we use data collected from cruise ships as the ground truth when training our algorithms. The error rate of our predictor is found to be small and hence the proposed approach can be used to estimate Carbon levels in any ocean. This work improves upon previous research in many ways including the use of sea water salinity as a proxy for Carbon estimates. Binary combinations of typically unary predictor attributes are used for the purposes of predicting the Carbon content of surface water and an inherently non-linear model is used to quantify the relationship.

Sooknanan and Hosein Journal of Big Data (2022) 9:93 The magnitude and rate of anthropogenic Carbon sequestered by the oceans exceeds the extent of variation due to natural sources for the past millennium [21] which further indicates oceanic saturation. The longest record for in-situ Carbon measurements begun in 1960, and shows that current rate of increase is as much as 30 times faster than preindustrial times [10]. Fossil fuels and cement production account for approximately 48% of the world's global Carbon emissions [52]. In addition, deforestation, industrialization and land-use-changes have led to the unprecedented increase in Carbon emissions over the past 200 years [15].
Ocean acidification is the change in ocean chemistry driven by the oceanic uptake of Carbon [15]. It is characterised by a series of chemical reactions initiated when CO 2 is absorbed by seawater. This CO 2 dissolves into Carbonate and Bicarbonate resulting in an increase in hydrogen ion concentration. According to Körtzinger [28], ocean acidity has increased by 26% since the start of the industrial revolution. Additionally, the atmospheric-oceanic carbon concentration gradient (difference between the Carbon concentration in the atmosphere and in surface layers of the ocean) is likely to further affect climate changes and warming scenarios [29]. Ocean acidification has significant negative impacts on fundamental bio-ecological ocean processes [30,52]. More alarmingly, past extinction events have been linked to ocean acidification [15], and the current rate of change in seawater chemistry is unprecedented [52].
The ability to quickly establish a baseline measurement for the Carbon content of large oceanic bodies is crucial in determining the levels of acidification in addition to assessing the rate of temperature rise in oceanic surface waters. However, there remains significant gaps in our ability to consistently and reliably estimate the carbon content of various sources and sinks, such as oceans [32]. Prediction and detection of carbon sinks are important issues with implications for all of human kind [21].
Laboratory measurements are the gold standard for assessing the carbon content of seawater, however research vessel time is costly and limited in coverage [31]. Remote sensing has the advantages of being fast, effective and near real-time [46]. The recent development of wide-band satellite imaging sensors has resulted in large quantities of high-resolution imagery being available [64]. Orbiting platforms additionally have the advantages of large observation range and high observation frequency [62] surpassing that of all alternate techniques such as in-situ buoys and marine vessels.
Remote sensing is yet to be fully exploited and has significant potential in providing extensive global measurements. However, there are currently no orbiting platforms capable of directly measuring oceanic Carbon levels. According to Tollefson [59], multiple technical and political issues plague the development and launch of additional Carbon-monitoring instruments. This is evidenced by the multiple studies [3,11,20,25,26,45,50,65] outlining various approaches at developing proxies and underscoring challenges in oceanic measurements from space.
Surface-water Carbon Dioxide is influenced by thermodynamic and biological factors and is adequately represented by the partial-pressure of Carbon dioxide pCO 2 at the surface. The most significant source is as a result of physical mixing processes caused by sea surface temperature (SST), sea surface salinity (SSS), chlorophyll-a (Chla), mixed layer depth (MLD), colored dissolved organic matter (CDOM), net primary productivity (NPP), photo-synthetically active radiation (PAR), wind speed and other factors.
Although sea-surface measurements may not fully encompass biological processes, observations at the surface are relevant proxies for oceanic Carbon content since the changes in carbonate chemistry due to atmospheric CO 2 occurs in the surface first [31]. Thus, remote sensing derived data via orbiting platforms hold great potential as a tool for monitoring changes in oceanic chemistry.
The remainder of this paper is organized as follows: In Section II, we review the related research in the field of earth observation and previous approaches at quantifying carbon levels, with special focus to how machine-learning methods facilitate the modeling of carbon levels. A description of the datasets used and the choice of carbon proxy is introduced in Section III. Section IV highlights the model-development process with particular focus to the choice of loss function. Section V details the results of modeldevelopment and training, and presents comparisons using permutations of predictor variables across three different models. Section VI underscores the scientific implications for our work, in addition to few limitations and potential for future work. Finally, our conclusions are discussed in Section VII.

Related work
In the past decade, a large number of new earth-observation orbiting platforms have been launched. As such, much effort has been placed on utilizing the unique, superior viewpoint of orbiting platforms for observation of both natural and human phenomena. In the following, we focus on works targeting large-scale oceanic measurements. In particular, we investigate works that do not derive oceanic carbon-related parameters (strictly) from first principles. Instead, we analyse works that employ data-driven approaches and statistical techniques to infer or derive patterns from observations.
To examine the oceanic carbon content, Poli et al. [48] aimed to calculate the constants for the dissociation of carbonic acid. These constants define the tendency of a higher-order molecule (such as Carbon Dioxide) to decompose into its constituents (ions of Hydrogen and various carbonates) and are typically a function of pressure and temperature. The partial pressure of CO 2 in ocean surface waters was then determined from Dissolved Inorganic Carbon (DIC) and Total Alkalinity. These were used to validate previous measurements of the constants of dissociation. However, Poli et al. [48] concluded that the optimal choice for these constants is subject to significant variability. Therefore usage of any one set of previously defined measurements for such derivations at a large scale were not recommended. This further justifies our usage of a data-driven approach as the physiochemical relationships vary depending on several inter-dependent parameters. We can instead measure statistical significance by means of inference with a higher degree of confidence for a given region than the alternative first-principles derivation.
Bates et al. [2] modelled seawater carbonate chemistry in the North Atlantic Ocean using in-situ measurements collected from Hydrostation S sites and a number of cruises. These measurements were collected at a minimum once per month since 1983 and were analysed for Dissolved Inorganic Carbon (DIC) by a variety of methods. The resulting DIC values was then used to establish a time-series for the region. However, according to Bates et al. [2], the data collected was heavily biased towards spring-time conditions owing to a sampling bias. This was evidenced by an apparent decrease in sea-surface temperature in in-situ measurements, which does not agree with other independent studies [27,57]. From this work, it was made clear that in-situ measurements must account for at least an annual cycle to overcome seasonal sampling biases.
Zui et al. [66] compared models for Dissolved Inorganic Carbon at the ocean surface using both satellite and in-situ data. Specifically, the Moderate Resolution Imaging Spectro-radiometer (MODIS) array of satellite sensors was used to establish a relationship with DIC measurements at two point locations over the course of 9 years. Zui et al. [66] did not derive a purely unknown relationship between oceanic parameters and DIC, but rather compared earlier models' performances with new data. In addition, their work regarded validation using in-situ measurements as the most precise method for confirming relationships found in their observations. This was a key theme in multiple published works as the ability to directly derive sea-surface, chemical parameters via remote-sensing is not yet possible.
Dixit et al. [9] analysed the partial pressure of Carbon Dioxide pCO 2 at the air-sea interface. A single autonomous system was deployed at 15°N 90°E in order to collect insitu measurements of pCO 2 . A linear, large-margin separation model was shown to more accurately estimate the relationship between SST and Salinity than a multiple-linearregression (MLR) model. According to Krishna et al. [29], the influence of Sea-Surface Salinity (SSS), Sea-Surface Temperature(SST) and Chl-a on pCO 2 varies depending on the domains of SST, SSS and Chl-a. Sabia et al. [51] therefore justified development of a multi-parametric (bracketed) non-linear model. It was shown that DIC values can be parameterized by Chlorophyll-a, Sea-Surface Salinity and Sea-Surface Temperature. However, the measurement values for Chlorophyll-A were not verified with in-situ data. Additionally, notwithstanding the use of multiple ranges for input attributes, accuracy was not comparable to previous approaches at regional models for CO 2 [51].
Fugacity is the surface signature of ocean acidity, dynamics, and bio-geochemistry [36]. Liu W. Timothy [36] developed a statistical model based on a linear Support-Vector Machine using Chlorophyll and Sea-Surface Temperature for predicting surface-level oceanic carbon content. Data was sourced from NASA's MODIS Satellites (Aqua and Terra) at a resolution of 0.25°. Owing to the large data gaps on these platforms however, significant smoothing and averaging was done in order to extrapolate point measurements from the satellite data.
Similarly, Liu and Xie [35] modelled Carbon Dioxide partial pressure at the ocean surface. However, according to the authors, their choice of model may not have fully captured the desired relationships owing to a relatively low temporal resolution (250000 over 8 years) and usage of a linear kernel. Additionally, large gaps existed in the satellite mapping which was interpolated heavily to accommodate for lower resolution images (as compared to what is currently available). Therefore their model usage for a specific region requires additional training using recent, region-specific in-situ measurements [35]. Moreover, according to the authors, sufficient salinity measurements were not available at the time of writing (similar to Liu W. Timothy [36]) at sufficient spatio-temporal resolutions. This resulted in a lack of trend in model prediction where predicted levels of Carbon Dioxide at the ocean surface did not follow the general increase as evidenced by in-situ measurements.
Following from the aforementioned, this work seeks to utilize recent advances in remote-sensing and integrated data sources to provide a Carbon model for the Caribbean region. This model exploits the relationship [58] between surface-level Carbon content and ocean-surface temperature.
Additionally, this work improves upon previous approaches (such as Liu W. Timothy [36]) in the use of sea-surface-salinity data and usage of in-situ data of significantly higher spatial density. Finally, this work utilizes an inherently non-linear method to robustly model the aforementioned relationships ( Table 1).

Description of datasets
The HYCOM dataset HYCOM is a data-assimilative hybrid isopycnal-sigma-pressure model and part of a multi-institutional effort sponsored by the National Ocean Partnership Program (NOPP) as part of the US Global Ocean Data Assimilation Experiment (GODAE). Within nonextreme latitudes (non-polar regions between 80.48°N and 80.48°S), HYCOM data are available on a standardized grid.
Data is assimilated from a combination of remote-sensing platforms GFO [1], ENVI-SAT [37] and Jason-1 [39] which provide information on space-time variability of surface-wind stress, temperature and specific humidity. Vertical profiles from eXpendable BathyThermographs, Argo floats and Conductivity-Temperature-Depth sensors enhance subsurface variability mapping. However, these profiles are typically too sparse to be used by themselves [5].
The resolution of HYCOM data are 0.08 arc-degrees (approximately 9km square). The main advantage of the HYCOM data is its indexing of ocean-surface parameters by means of z-coordinates. z-coordinates index ocean depth at standard levels, and allows for a smooth transition from upper-ocean to deep-ocean layers. This results in Liu and Xie [35] Carbon dioxide partial pressure was modelled using a linear kernel The temporal resolution of input data on a yearly-basis was objectively low, with large gaps in the satellite mapping data. Additionally, the authors highlighted the lack of dense, in-situ salinity measurements for model verification an ease of comparative computations at differing sea-depths. Additionally, there is no current single orbiting platform with the necessary spatio-temporal density for inference of parameters which define latent relationships between observed variables [14,42,43]. The closest comparable earth-observing platform is NASA's Orbiting Carbon Observatories (OCO-2 and OCO-3). However these platforms are targeted to measuring atmospheric carbon levels and do not target the intended, ocean-based carbon. HYCOM data contain the following combination of oceanic parameters indexed by z-coordinates, latitude, longitude, reference date/time and depth: For this work, the z-coordinate (depth), location (latitude and longitude), sea water salinity s i and sea water temperature t i are used where the subscript i ∈ {0, 2} indicates depth in meters. Table 2 shows latitude, longitude, temperature (C) and salinity (psu) measurements from the HYCOM data for the uppermost z-levels (0m and 2m respectively). In Fig. 1 we provide the mean salinity and mean temperature for the Caribbean region at a depth of 0m.

NOAA labelled data
The Ocean Chemistry and Ecosystems Division (OCED) of The National Oceanic and Atmospheric Administration (NOAA) focuses on understanding the ocean's role within the context of the global environment. Automated systems for pCO 2 measurement were installed on cruise ships of Royal Caribbean International Cruises and subsidiaries Wanninkhof et al. [60] by the NOAA. This system provides measurements of multiple ocean water parameters beginning in 2002 and continuing to the present. The instruments consist of equilibrators, a condenser, water flow meter, drying tubes and additional equipment for analysing the output of the equilibrator [47]. For this study data from the Allure of the Seas was used for the period 2019-2020 within the Caribbean Region. This data covered the range of latitudes between 11.677 • N and -Mole fraction of CO 2 in the equilibrator headspace (dry) at equilibrator temperature -Mole fraction of CO 2 measured in dry outside air -Mole fraction of CO 2 in outside air associated with each water analysis ppm -Barometric pressure in the equilibrator headspace -Barometric pressure measured outside -Water temperature in equilibrator -Sea surface temperature -Sea surface salinity The mean values for the above parameters were found such that the spatial resolution matched the HYCOM grids. Two flags in this data indicated nominal operation of the equilibrator. Values out of range (negative) and other anomalous instances were removed by filtering against these quality flags. The latitude-longitude positions were then used to cross-reference the HYCOM data resulting in attributes above (specifically Fugacity) being indexed by spatial (latitude/longitude) and z (depth) coordinates. A sample of equilibrator attributes eq, atmospheric attributes atm and sea attributes sea are given in Table 3. All pressures p are measured in standard hPa, temperatures t are given in degrees Celsius and Carbon-dioxide/Fugacity f are given in units of micro-atmospheres.

Carbon fugacity
In order to measure the air-sea exchange of gases, the partial pressure of gas in the ocean surface is first determined. The concentration of gas in the equilibrator headspace (gaseous) is directly proportional to the concentration in the equivalent volume of seawater (liquid), parameterized by temperature and salinity [24] and so C w = αC e .
The instrumentation installed in the ships of Royal Caribbean measures the mole fraction of CO 2 in dry air which is converted to fugacity by correcting for the nonideal gas and the water vapour level [22]. An alternative approach, in determining air-sea flux, is given by the difference in partial pressures between the air and sea CO 2 However, the wind-speed dependent gas exchange coefficient k g is not precisely known and there exists a discrepancy in the global mean values obtained by the two different methods [56]. In Fig. 2 we show the relationship between Carbon Fugacity and temperature (left) and salinity (right) at 0m.

Methods
A supervised learning algorithm models the implicit relationship existing in labelled data by means of a set of equations. Given a set of labelled data D, a supervised learning algorithm aims to learn the relationship between input attributes X and an output attribute y in order to predict the output ŷ given previously unseen X. Supervised learning algorithms receive feedback from a loss function, which quantitatively informs how closely the model matches the relationships within the data [54]. In our case the output attribute y is the fugacity of the partial-pressure of Carbon Dioxide which is dependent on the input attributes as determined by a feature-selection process. We use a gradient-boosting regression tree which provides a continuous-valued output as a nonlinear function of its input attributes.

Decision-tree regression
The supervised gradient-boosting regression tree (GBRT) iteratively defines a series of mappings from a labelled training dataset and progressively refines itself by means of an explicit loss function [41]. The GBRT learns an input-output relation by means of a series of conditional, thresholding operations. It is able to break down a complex decision-making process into a collection of simpler decisions thus providing a solution which is often easier to interpret [53]. In other terms, at each stage in the regression tree, the output domain is refined by means of an information criteria. The stochastic Gradient Boosted Regression Tree is adaptable, easy to interpret and produces highly accurate models [61]. The Stochastic GBRT starts with a single decision-tree and iteratively appends new trees based on their performance with respect to some objective function.
The heuristic used to define branching decisions generates splits based on the distribution of input attributes, and greedily selects new decision trees f based on an objective function Z: A regularization term Ω is included, to prevent over-fitting [63], by means of penalizing tree coefficients. In order to offset the overhead of the Gradient Boosted Decision Tree (GBDT) approach on large datasets, the NVIDIA RAPIDS [18] and the XGBoost [6] libraries were implemented on a GPU.

Loss function
A loss function is used to quantify the performance of a given learning algorithm. Many loss functions have been proposed for supervised learning however the Mean-Absolute Error and Root Mean-Squared Error have become popular in the field of geosciences [4]. The MAE metric assigns equal weighting to all errors (known as l 1 optimization). If we denote ground-truth values as y and predicted values as ŷ , the loss value is directly proportional to the magnitude of divergence in the predicted value ŷ from the ground truth label y. The Root-Mean-Squared-Error (RMSE) attempts to penalize variance in output predictions by squaring the error term from MAE (also known as l 2 optimization): This results in unequal weighting of error terms, where larger error values are more heavily penalized than smaller values. While the MAE and RMSE have been used as standard metrics for model performance for many years, there is no consensus on the most appropriate metric for modeling errors [4]. Additionally, measures based on the sum-of-squared calculation do not describe average error alone. The distribution of error magnitudes become more variable in a non-monotonic fashion with increasing error [4].
As a result, the Huber loss is typically used to optimize the regression tree. The Huber loss behaves quadratically for small residual errors and linearly for large residual errors [19]. The Huber loss is given by the following: This is equivalent to minimizing the Kullback-Leibler divergence [40] and is hence used in robust regression to take advantage of the desirable properties of both l 1 and l 2 penalties. The Huber Loss is regulated by the hyper-parameter α > 0 . For absolute values smaller than α the corresponding distribution resembles the normal distribution while for values outside this range it resembles the Laplace distribution. This is the equivalent of a data-defined transition from a quadratic to absolute-valued function [40]. This affords the Huber loss a significant advantage in managing outliers (common in remote sensing data) when compared to MAE and RMSE.
This loss function is used to guide and actively correct the learning of the stochastic GBRT during training and additionally used to validate the stochastic GBRT during testing. A loss function is used to concisely quantify the performance of a machine-learning model and its derivative is used to quantify the magnitude of correction required for model parameters. Following training, the model is evaluated using the loss function and updated based on the magnitude of the errors made. This process is complete when the model's increase in performance (or change in parameters) do not exceed some predefined threshold.

Data selection
Water temperature and Sea-Surface-Salinity at depths of 0m, 2m, 4m, 6m and 8m from the HYCOM platform was used for regression [55]. The target data are Fugacity values ( fCO 2 ) for ocean surface water from the NOAA data. Binary interaction attributes were derived from the single attribute values by means of multiplicative combination, for a total of 32+5 predictor variables. In order to determine an optimal set of predictor attributes, statistical p-value testing was used to determine the significance of relationship between predictor attributes and the target data. A lower p-value indicates a decreased probability of the observed relation occurring by pure chance. Therefore the chances of disproving a non-trivial relationship between predictor and target attributes are directly proportional to the p-value. Both unary and binary predictor attributes were explored for the prediction task. The target size α = 0.05 was used as a threshold to filter non-predictive attributes in both unary and binary cases. Stage-wise variable selection was used by means of the method of Ordinary Least Squares Regression resulting in the predictor attributes in Tables 4  and 5, with feature importance for unary predictors being displayed in Fig. 3.

Model selection
Note that we initially investigated three models, Linear Regression, Gradient-Boosted Regression Tree and a Deep Neural Network consisting of 15 fully-connected layers. These models were trained using GPU libraries and evaluated with the Huber loss function. Since the GBRT model performed the best we focused solely on that approach. However, in order to demonstrate the differences we provide MSE values for the three approaches in Table 6.  The GBRT was implemented using the XGBoost library and NVIDIA RAPIDS on a GTX 1060M GPU. Enumerative grid-search was used to tune parameters for the gradient boosted tree. After 1000 iterations for each combination in the parameter subspace the optimal values listed in Fig. 7 were obtained.

Model validation
Ground-truth in-situ measurements were used to validate our model. This target data consisted fCO 2 in units of micro-atmospheres ( µ-atm). The domain of values for fCO 2 as measured by the equilibrator (for all valid measurements as indicated by the use of the quality flag) was [397.68, 492.46] with a mean of 400.73 and a standard-deviation of 19.78. Roughly one-third of the training data was reserved for model validation. The model was trained using cross-validation and evaluated using the Huber Loss function. Using the selected attributes and optimal GBRT, the Huber Loss for fCO 2 was found to   Fig. 4, usage of binary interaction variables significantly improved the residual distribution of our model prediction. As shown in Fig. 4, the residual (error) terms for binary interaction variables are zero-centred with little spread. This is desirable, particularly when compared against the case of unitary predictors. This inconsistent spread of high variance (an undesirable property) leads to inconsistent model behaviour. In contrast the symmetric, narrow residual distribution for our model with interaction variables is indicative of consistent model performance (low variance).

Scientific implications
Traditional remote-sensing (satellite-based) approaches do not measure Ocean Carbon levels directly. In this work, we investigate and the impact of sea-surface salinity and seasurface temperature on the fugacity of Carbon-Dioxide, fC0 2 , at the surface layer of the ocean. This work confirms that SST at 0m is the primary unary predictor on fC0 2 [35,58].
This work does not rely on multiple stationary in-surface installations for the purpose of making predictions. Therefore, the model developed in this work can be applied to sea-surface areas where in-situ data is not currently available. Moreover, the periodic nature of satellite observations enables our model to be the basis for spatial and temporal analysis. Similarly, our work may be used for the discovery and quantification of both Carbon sources and sinks in the open ocean. In this way, the model developed may be used to find novel Carbon sinks in the open ocean and quantify the rate of sequestration over time without the need for earth-based, surface-level measurements.
The increased longitudinal spread of the in-situ data used greatly decreases the influence of coastal anomalies on our model's derivation. This work describes the methods by which remote sensing data can be used to indirectly estimate surface oceanic carbon content. Moreover, the ability of remotely-sensed data sources to operate in times of anomalous weather conditions on a regular, periodic basis establishes a non-trivial advantage when compared to earth-borne surface methods. As of writing, there is no consensus agreement on the relationship between SST, SSS and fCO 2 . In order to account for annual weather events, this data period captured an entire annual cycle thereby controlling against seasonal oceanic events. This enables our model to be used year-round in non-extreme latitudes notwithstanding the non-availability of insitu cruise ship measurements.
The low variance and error terms observed from our model is validated using in-situ surface data. The validation described in Section 5.3 above shows accuracy greater than that which was achieved by a similar global model Liu W. Timothy [36] and Krishna et al. [29]. Our work improves upon both of these approaches by incorporating SSS for all data. In addition, this work improves upon previous approaches by consideration of binary factors of influence (see Table 5) and application of an inherently non-linear model.

Some limitations and potential future work
The target area for this work was the Caribbean region and, as a result, is limited to nonextreme (non-polar latitudes). Additionally, the large-scale, inter-sea mixing of waters is left for future investigation. Several studies have investigated the relationship between Chlorophyll-A surface concentrations and its impact on surface-level Carbon content. Moreover, the impact of wind speed and direction has been shown to influence CO 2 flux at the sea-air boundary. Future work may potentially explore the usage of multi-modal approaches [44], particularly in the field of deep-learning [33], which has proven to be useful in handling heterogeneous inputs across multiple fields, including remote-sensing [17,38,49] Notwithstanding, the relationship between SST and SSS holds as validated in 5.3. However, this relationship between surface winds and fCO 2 at the ocean surface may explain outliers observed during the validation process [36].
The Orbiting Carbon Observatory-2 (OCO-2) and OCO-3 are NASA's first Earth-orbiting satellites dedicated to measuring atmospheric Carbon Dioxide. At the time of writing, these instruments are dedicated strictly to observing near-infrared CO 2 and A-band molecular oxygen. However, future orbiting platforms may investigate the relationships between surface-level atmospheric concentration and the rate of transfer between air and ocean [35].

Conclusion
In this paper, a non-linear model is implemented for satellite data and validated using insitu measurements in the Caribbean region. This model exploited the relationship between temperature, salinity and sea-surface Carbon concentration to remotely predict surfacewater Carbon content. We found that limiting scope of observations to non-extreme latitudes yielded lower losses when compared to global approaches.
We also found that the usage of binary predictor attributes significantly reduced prediction errors when compared to previous approaches. Additionally, the usage of an inherently non-linear model with the Huber loss-function improved upon previous approaches that used linear penalised models. Finally, our results were validated using a rich source of in-situ measurements made available via the NOAA. Future work may seek to incorporate Chl-a, surface wind-velocity as well as other anthropogenic factors into prediction models.