Reversible data hiding with segmented secrets and smoothed samples in various audio genres

In this age, information technology has grown significantly. Computer networks, which connect a device to others, have made it easier for people to transfer data than before. Moreover, smart devices have the capability of supporting this data transmission, including those in the cloud that may contain massive data. However, the security factor can be a severe issue if sensitive big data, such as military and medical data, do not have enough protection. Furthermore, an attacker may be able to disclose such data. Some algorithms have been introduced to solve that problem, one of which is the data hiding method. Nevertheless, some factors are still challenging, concerning the capacity of the secret data and the quality of the generated data, which are represented by bit and Peak Signal-to-Noise Ratio (PSNR), respectively. Besides, some techniques are not reversible, which means that they cannot reconstruct the carrier (cover). In this research, we investigate those problems by taking audio as the carrier. It is done by sampling the audio file before being interpolated to present spaces for accommodating the secret. Meanwhile, the secret is segmented before the embedding. Later, the embedded audio is smoothed according to the required level. The experimental result is obtained by using a public data set containing various audio genres and instruments, and 11 secret sizes, from 1 to 100 kb. It shows that the proposed method outperforms the others. This higher PSNR value means that the proposed method can generate more similar stego data; it also implies that at a certain quality level, the number of bits that can be hidden in the audio cover is higher than that of others.

to compromise some properties at once, such as in [5], where several security properties are illegally taken over. Moreover, advanced technology has made limited devices more popular than ever [6], which may attract attackers to access it [1].
In many cases, a system comprising big data has been targeted for attack, which may devastate their availability and integrity [2,5,7]. Various methods have been introduced to protect those security properties, specifically the confidentiality. For example, cryptography and steganography, which is also called data hiding, are widely implemented. In this paper, the terms steganography and data hiding are used interchangeably. Although the purpose of cryptography and steganography are the same, those two schemes have different characteristics. In the application level, they can be implemented in a system, such as in [8][9][10][11][12][13]. However, this combination affects its performance [14]. On its progress, it is also possible to use data hiding for forensics [15] or secret sharing [16]. In general, the concept of data or information hiding can be depicted in Fig. 1, where the secret (payload) is hidden in a cover (carrier) to produce a stego file.
The types of cover vary. It can be an image, video, or audio, where the last is less popular than those two other types [17]. Recently, the text has also been investigated [18] to complement the previous media. In fact, audio has unique characteristics to explore further. The algorithms being applied can be extensions and variations of either imageor video-based methods, such as Echo Data Hiding (EDH) [19] and Histogram Shifting (HS) [20,21] which is then extended in [22] by applying a statistical histogram. Other examples are Prediction Error Expansion (PEE) [23], Noncausal Prediction (NP) [24], and Difference Expansion (DE) along with its variations, such as Reduced Difference Fig. 1 The overall concept of audio-based data hiding Expansion (RDE) [25,26]. An extended method has been developed by implementing machine learning, which is claimed to be suitable for smart devices [27].
Some recent algorithms have been introduced to get effective results. For example, Jung and Yoo [28,29] propose the use of an interpolation algorithm for the embedding space. Next, Ahmad and Fiqar [30] explore the quality of the stego audio, which is then further refined in [31]. Previously, Andra et al. [32] focused on the protection of medical data, also by using audio, while Bobeica et al. [33] extended the concept of PEE.
Nevertheless, those existing methods still have challenging problems: the capacity of the secret, the quality of the stego, and the reversibility of both the secret and the cover. Capacity refers to the size of the data or the number of bits that can be accommodated by the cover. Data hiding methods may have different capacities when they are implemented to the same cover. Quality refers to the similarity level between the cover and the stego data. Like the capacity factor, the higher the quality, the better the method. As predicted, however, those two factors are typically inversely proportional. In this research, we work on these issues by designing an extended embedding and smoothing approaches in the audio environment. Here, an audio signal is first sampled and normalized before being embedded by the segmented secret. Next, the stego is smoothed to make it as similar to the cover as possible.
The remaining parts of the paper are organized as follows. In "Audio-based data hiding techniques" section, we present the earlier research on data hiding techniques. "The design of reversible data hiding" section describes how the proposed reversible data hiding method works. The experimental results and the corresponding analysis are provided in "Experimental results" section. Moreover, this proposed method is compared with related methods to evaluate its performance. Next, the discussion is presented in "Discussion" section. Lastly, the conclusion of this research is drawn in "Conclusions" section.

Audio-based data hiding techniques
As previously described, audio is not as popular as other media for carrying the secret [17]. Nevertheless, it has unique characteristics, which have the potential to be explored. Some state-of-the-art audio-based research can be described as below.
Depending on its specific purpose, data hiding is designed by considering the basic requirements of security, including its reversibility. In case only the secret can be extracted from the stego, the method is called irreversible; on the other hand, if both the secret and the cover can be reconstructed the same as their initial data, then the method is reversible data hiding (RDH). In most cases, RDH is crucial for fulfilling confidentiality and authentication requirements [34].
The size of the secret can be embedded in a cover is one of the challenges faced by RDH and also represents the possible types of data to protect. Therefore, most RDH methods take it as an essential consideration. To meet this requirement, [28,29] implement Neighbor Mean Interpolation (NMI) to expand the space. On the other hand, [29] takes various numbers of the bit to hide, while [28] sets it as a fixed number. Next, this feature is adopted in [30] and [31], where respectively Newton's Divided Difference Interpolation (NDDI) and linear interpolation methods are implemented.
The research in [28,29] pays more attention to the capacity of the secret. Therefore, the method does not do much to improve quality. Nevertheless, its stego quality is relatively good enough that ordinary users may not be able to distinguish noise from the original file. Andra et al. [32] refine this possible quality problem by modifying the intelligent partitioning method before the embedding stage. In this case, a 2D array of data is developed based on the sampled signal. It is done by arranging the audio samples in a particular order. This approach is different from the others, where the audio signal is treated as 1D. For the embedding, [32] employs improved RDE. Their experiment is performed by embedding several sizes of secret medical data into various audio covers. It is shown that their scheme works well.
Similar to [32], the research in [33] does not implement interpolation. Instead, it reserves some audio samples to carry the specific data, where the cross and dot areas are defined at the beginning. It needs to find appropriate threshold values, which are used for the embedding. This stage, however, may not be simple since those values depend on some other factors. Furthermore, as [32], the number of bits to conceal in each sample is static. The experiment shows that their stego quality is relatively stable.
Considering the stego quality, Ahmad and Fiqar [30] introduce the smoothing stage to reduce the difference between the stego and the cover. It is claimed that this scheme is able to work on it, which is proven in their experimental results. Nevertheless, its performance relies on the genre of the audio cover. It can also be an advantage, considering that a user may have a unique preference. In the next research, the noise in the stego audio is decreased by applying multiple-smoothing [31]. Furthermore, an additional step is also implemented, called a reducing step. In most conditions, this method generates a better stego audio.
Generally, this previous research is to be the base of our proposed method. Some stages are taken and refined to get better results. Initially, the NMI and NDDI interpolation algorithms, which are implemented in [28,29,30], respectively, have inspired us to use them. Nevertheless, the linear method employed in [31] potentially delivers an equivalent result. Therefore, we take this simpler algorithm in our design. Moreover, [32,33], which do not take the interpolation at all, may have a relatively lower capacity. Each sample's payload size in [32,33] is static, which is different from that of [30,31]. So, the payload's dynamic (non-static) size and the smoothing step have also been our focus in this research. Unlike [31,32], we integrate it with the reducing step. Briefly, we take a simple interpolation algorithm, improve each sample's dynamic capacity, and design the smoothing to include the reducing step. Accordingly, the extraction and reconstruction follow this embedding. More details of the proposed algorithm and its comparison with the baseline methods are given in "The design of reversible data hiding" section.

The design of reversible data hiding
As shown in Fig. 1, the secret data protection is carried out in two stages: embedding and reconstruction / extraction. These processes can be described as follows.

Embedding stage
In this study, the pre-processing of both the audio cover and the secret is carried out in parallel. Once those data are ready, the embedding stage is performed, followed by post-processing. This pre-processing step includes binarization and segmentation of the secret and providing an embedding space in each sample, while post-processing includes smoothing and combining samples for preparing the stego audio. Optionally, data compression can be applied to reduce its size. In more detail, this embedding process is depicted in Fig. 2, where its stages can be described as follows.  (1), where y j is the interpolating sample resulted from i th and (i + 1) th samples, where i and j are the index of original and interpolating samples, respectively. The y j is to be a pivoting point where the embedding in the next step will be done. After the interpolation, the vector of samples is s T = (y 1 , y 2 , y 3 , y 4 , ..., y n−2 , y n−1 , y n ) , whose illustration is depicted in Fig. 3. It is shown that the index of both the original and the interpolating audio samples are shifted.
3. Calculating sampling space. The number of bits that can be protected by each sample may not be the same. It depends on the characteristics of the cover as well as the sample itself. Based on the vector s T , firstly, we need to find the average of the difference between two consecutive samples of the original cover ( s o ) before being interpolated by using (2). It is worth noting that now there is a sample y i between y (i−1) and y (i+1) . Next, the multiplication factor, m, is determined by using (3), where b is the bit-depth of the sample, obtaining from the sampling process in step 1, which in this research is 16. Finally, the sampling space of each sample is found by designing (4), where c i is the space (maximum capacity) of the interpolating sample y i , generated based on the original samples y (i−1) and y (i+1) .
(1) To find the position of where the embedding should be carried out, it needs to locate the position of the sampling space of each corresponding sample, whether it is above or below the magnitude of the pivoting point. For this purpose, we define p i and recalculate c i whenever needed, by designing Algorithm 1. 4. Segmenting payload. Each sample may not fully take the maximum number of payloads. Some factors should be considered, such as the distribution across those interpolating samples, which affects the quality of the generated stego audio file [35]. For this reason, c i is further processed using (5), where N i is the number of bits hidden in the corresponding sample.
At this stage, the binary payload is segmented according to N i of i th sample. In case it is bigger than the payload to hide, then it is likely that the remaining interpolating samples are not used. Thus, each sample may hold different numbers of bits, which are then converted to decimal ( d i ). 5. Embedding. The embedding is performed by adding or subtracting the payload from the interpolating sample, relying on the embedding space's position p i . This process is depicted in (6). It can be seen that there is a difference between y i and y ′ i , which is resulted from this embedding process. In case the value of d i is relatively large, the quality of the audio dramatically drops.

6.
Smoothing. This step aims to reduce the difference caused by the embedding. To make the difference in the sample value between before and after embedding as small as possible, we design a smoothing step which can be formulated in (7), where y ′′ (i,j) is the interpolating sample after the j th smoothing is applied to the embedded sample. The smoothing coefficient (e) and the maximum number of smoothing instances (h) should be defined. A larger e means that less h is required to achieve the specified quality; consequently, less computation should be done, which is good. However, it needs more numbers to store for extraction purposes.
Along with reducing of the noise, a remainder (r) is taken by using (8). This value is then required for extraction. It is shown that its value relies on e. It needs to find its appropriate value, which is neither too low nor too high.
Combining samples. The smoothed-embedded sample y ′′ (i,h) is then put in s ′′ T . Therefore, the vector of samples is s ′′ T = (y 1 , y ′′ (2,h) , y 3 , y ′′ (4,h) , ..., y n−2 , y ′′ (n−1,h) , y n ) , which is then de-normalized to make it in the original range value. Its frame rate is now 88.2kHz. Let the binary secret be 1010000111. An example of this embedding stage can be given as follows. Here, the smoothing is only performed once for simplicity (h = 1) ; to achieve better quality, this smoothing step can be done several times. The embedding is applied to the first interpolating sample because its maximum capacity is higher than the number of bits.

Extraction stage
This stage aims to extract the payload and to reconstruct the cover as the proposed method is designed to be reversible. Parameter values are directly generated from the stego file, such as the sampling space obtained after the original samples have been split from the interpolating ones. This general process is presented in Fig. 4. Based on the received values, the reverse smoothing is performed by using (9). This process is done as many times as the number of smoothing instances in the embedding stage. Therefore, in the last iteration we have y ′′ (i,j) = y ′ i , which is a non-smoothing embedded sample.
The hidden payload of i th sample is obtained by using (10). Along with other payloads, this decimal value is then converted into binary, the number of digits of which is Reconstruction of the audio cover is performed after the odd-indexed samples have been de-normalized back to the original range values. The reconstructed audio is in *.WAV file whose sampling rate is 44.1 kHz. In general, the comparison between the proposed and other methods is summarized in Table 1.

Experimental results
As in other research, such as [29][30][31][32][33], the proposed method is evaluated based on the following factors. The first is the level of similarity between the cover and the generated stego files. As previously described, this is to be the quality of the produced stego audio measured by the Peak Signal-to-Noise Ratio (PSNR) that is represented in dB. It is calculated by using (11), whose Mean Squared Error (MSE) value is taken from (12). The constant MAX is specified by the maximum possible value of the sample. It is worth to note that audio samples are 1D, which is different from either image or video. So, the variables m and n are adjusted.
Overall, this PSNR value is likely to be affected by the second evaluation factor (i.e., the size of the secret message represented in bits). Therefore, those two factors should be calculated concurrently.

Evaluation factor
Research in [28,29] Research in [30] Research in [32] Research in [33] Method of [31] Proposed method Embedding For this measurement, we obtain a public data set of audio files from [36,37]. It comprises 15 files, consisting of 3 genres, each of which is made up of 5 instruments, as shown in Table 2. For the secret, we generate 11 files with various sizes, starting from 1 to 100 kb, whose detail is provided in Table 3. We believe that these numbers of bits reflect the actual need in the real environment. Furthermore, we implement those experimental data to other research [29][30][31][32][33], such that the comparison can be performed as fairly as possible.

Multiplier
As previously described in "Embedding stage" section, the embedding result is influenced by the multiplier m, whose value is dynamically determined by (3). According to the experimental results, we find that smaller m means a higher quality of the stego data. Therefore, in the next experiment, we set m statically. For this second  type of m, we take m ∈ {1, 2, 3, 4, 5, 6} , whose experimental result is given in Fig. 5 where e = 3 and h = 1 . In Fig. 5, it is depicted that by setting m = 1 , the generated stego data are the best; on the contrary, m = 6 produces the lowest PSNR. On the other hand, the capacity of the secret that can be embedded is lower. This pattern also works on other values of e and h. In evaluating the effect of this multiplier, we take e = 3 and h = 1 because those values produce the minimum quality of the stego. It means that it is very likely to obtain a better stego quality than what has been achieved in this subsection (see the next analysis in "Smoothing coefficient and smoothing level" section). On average, the PSNR is provided in Fig. 6. It describes that each audio genre has the same effects on the quality of the stego. The Audio3 (pop-rock_cello) should not be used if the quality is to be the concern, while Audio2 (classical_cello), Audio11 (classical_saxophone), Audio5 (classical_acoustic-guitar), Audio13 (country-folk_ voice) and Audio8 (classical_piano) are more appropriate to use. It is found that, in general, the classical genre produces better stego quality than the others. Nevertheless, as previously predicted, there is a trade-off between the quality and the capacity. That is, Audio3 is the best choice if relatively bigger data are to be protected. This trend also works on Audio12, Audio15, and Audio9, whose genre is pop-rock. The other genres may be applicable if a balance between the quality and the capacity is required.
In more detail, the experiment also shows that lower m causes more space to hold the secret in each sample. Consequently, embedding mainly occurs in fewer samples. It is different from higher m, which provides fewer spaces in each sample; this condition has caused the secret to be widely spread amongst samples. It can be inferred that the number of embedded samples affects the quality of the stego file. An example of this condition is illustrated in Fig. 7, which plots the maximum number of bits that can be embedded and the respective number of sampling spaces. This figure shows this characteristic that the capacity is inversely proportional to the number of spaces. On the contrary, the quality is proportional to the number of spaces.

Smoothing coefficient and smoothing level
The smoothing coefficient (e) is designed to be used in (7) in the embedding process. From the experiment, we find that the higher the e, the higher the PSNR. Nevertheless, as described in the previous section, it is inversely proportional to the capacity. For this reason, in the experiment, we determine the smoothing coefficient of e ∈ {1, 2, 3, 4, 5} , whose result is plotted in Fig. 8, where m = 1 and h = 1.
It is also depicted in that figure that for e = 4 and e = 5 , the PSNR of stego file generated from Audio13 is infinite when the payload is 1 kb. It means that it is the same as its original cover, which is an ideal condition. When the smoothing is performed three times ( h = 3 ) with e = 5 , the PSNR value of all stego files is infinite. It is shown that, at this point, increasing the e is useless because the best quality has been achieved. This ideal condition is also reached by e = 3 and e = 4 when the smoothing is implemented five and four times, respectively. For the number of smoothing less than those, only specific covers obtain the infinite value: Audio2, Audio11, Audio5, Audio13. It is consistent with the results, which have been discussed in "Multiplier" section.
It can be inferred that higher e takes fewer steps to reach the best condition, which is good. At a certain level, raising that value does not affect the quality. Therefore, finding an appropriate e is essential. It is also shown that increasing h instead of e can also improve quality. Some considerations of which parameter should be taken among them have been discussed in "The design of reversible data hiding" section. Next, the capacity of the embedding should also be a consideration. Furthermore, as predicted, the value of h is proportional to the PSNR, regardless of m and e, where a smoothing step may increase around 20-30 dB of PSNR.

Research comparison
For comparing with other research, we take some methods proposed by Bobeica et al. [33], Ahmad and Fiqar [30], Andra et al. [32], Ahmad et al. [31], and Jung and Yoo [29]. To make the comparison as fair as possible, we implement them such that we can evaluate the methods by using the same data, whose results are provided in Figs. 9, 10, and 11. In those figures, the smoothing is applied 1, 3, and 4 times, respectively, depending on their characteristics. Furthermore, in each graph, we show the performance of the proposed technique, which is evaluated without using the smoothing step, considering that not all compared algorithms are designed with the smoothing step.
It is provided in Fig. 9 that without the smoothing, the proposed method is lower than [29], even though it is still higher than the rest of the methods. By implementing onetime smoothing ( h = 1 ), the proposed method is excellent. In this environment, [29] is gently lower than the proposed algorithm. When the smoothing step is carried out three times (see Fig. 10), the proposed method is still higher than the others. Furthermore, the algorithms in [29,32,33] do not have the smoothing step; therefore, their PSNR does not change. It is different from this proposed method, which has flexibility in achieving the required level of the stego file. Finally, by doing smoothing four times (see Fig. 11), almost all PSNR values of the proposed method are infinite. This condition is shown in Fig. 9 The quality of the stego file generated by Jung and Yoo [29], Ahmad and Fiqar [30], Andra et al. [32], Bobeica et al. [33], Ahmad et al. [31], and the proposed technique with e = 3 , m = 1 , h = 0 and h = 1 Fig. 10 The quality of the stego file generated by Jung and Yoo [29], Ahmad and Fiqar [30], Andra et al. [32], Bobeica et al. [33], Ahmad et al. [31], and the proposed technique with e = 3 , m = 1 , h = 0 and h = 3 Figs. 10 and 11, where the proposed technique is mostly out of those figures since its value is infinite. Please note that here, e = 3 and m = 1 are the standard in the evaluation. As described in Section 4.2, the performance of the proposed technique is even better than that of others for a higher e.

Discussion
According to the experimental results that have been provided in the previous section, we can determine that the proposed technique outdoes the others. One of the advantages is its flexibility, which can be used to specify the level of the stego file's quality and the size of the secret to protect. Moreover, the reversibility requirement has also been met.
Despite this superiority, it has factors to refine. For the smoothing purpose, the method produces an auxiliary file containing the required information. Considering the security, we may just implement an existing cryptographic algorithm, such as Advanced Encryption Standard (AES), which is out of this data hiding research. Another factor to consider is the size of this auxiliary file, which should be as small as possible. In the case that compression is implemented, it can be performed according to Figs. 2 and 4 for the embedding and extraction, respectively. It is depicted that those two processes are independent of the main data hiding research. Therefore, any algorithm may be suitable to implement.
Considering that research on data hiding is growing fast, any stage of this proposed method may be combined with other research to produce better results. For this purpose, the stages can be grouped in blocks independently, as depicted in Figs. 2 and 4, for embedding and extraction, respectively. On the other hand, attackers may try to detect Fig. 11 The quality of the stego file generated by Jung and Yoo [29], Ahmad and Fiqar [30], Andra et al. [32], Bobeica et al. [33], Ahmad et al. [31], and the proposed technique with e = 3 , m = 1 , h = 0 and h = 4 the existence of hidden information in the stego audio [38,39]. Although it is out of the scope of this investigation, this possible attack should be anticipated.

Conclusions
In this research, we have proposed a reversible data hiding technique to protect sensitive data in the audio environment. It is carried out by interpolating the audio samples, which have been obtained from the sampling process. The proposed method is flexible in that it can be applied to meet the required performance. Regarding the quality, classical audio is more appropriate to use, combining it with either cello, saxophone, acoustic guitar, or piano. Besides, the quality is proportional to the number of spaces in each sample. If the capacity is the concern, then pop-rock is better to use than others. The implementation itself can focus on either quality or capacity. Moreover, this proposed method has satisfied the reversibility requirement.
The experimental results also signify that the proposed technique is of higher quality than the others in almost all conditions. By applying a one-time smoothing, the method achieves the best results. Moreover, most existing methods even do not have this smoothing stage, which makes their quality static.
In the future, we would like to work on additional factors that do not directly relate to the performance of the proposed technique, which includes reducing the size of the auxiliary file. We believe that it should be as small as possible. An algorithm to protect this file can also be considered, although existing cryptographic algorithms are applicable. Furthermore, the multiplier and the coefficient should be dynamically defined to find a possible value that can deliver better results. This value can be adaptive, considering the characteristics of each cover.