Skip to main content

Novel mathematical model for the classification of music and rhythmic genre using deep neural network


Music Genre Classification (MGC) is a crucial undertaking that categorizes Music Genre (MG) based on auditory information. MGC is commonly employed in the retrieval of music information. The three main stages of the proposed system are data readiness, feature mining, and categorization. To categorize MG, a new neural network was deployed. The proposed system uses features from spectrographs derived from short clips of songs as inputs to a projected scheme building to categorize songs into an appropriate MG. Extensive experiment on the GTZAN dataset, Indian Music Genre(IMG) dataset, Hindustan Music Rhythm (HMR) and Tabala Dataset show that the proposed strategy is more effective than existing methods. Indian rhythms were used to test the proposed system design. The proposed system design was compared with other existing algorithms based on time and space complexity.


Currently, huge amounts of data are being produced on many community media platforms. Because the amount of data is so large in terms of bytes, it is critical to categorize it so that users can simply search for and promote it. Our work is mostly concerned with the song data. Music-like audio can be processed or evaluated as a song melody using instruments only, only acoustic songs without electric instruments, or songs with both music and instruments. Audio or musical gestures must be converted to a binary format before being processed on a computer. Audio data is an analog signal, which is converted to a digital format by an analog-to-digital converter. The spectrogram and waves depict the audio signals.

Currently, music or tunes are broadly scattered on the Internet, and the measures of transferring or discharging music/melodies are simply expanding and not halting or reducing. Effective association and processing of such huge amounts of information, just as effective ordering, looking, and recovery of such enormous measures of musical data is a major challenge. Most music arrangement techniques are based on music classified by classification, state of mind, or various semantic tags [1]. The music Genre is a significant angle used to classify music/melodies in music shops, broadcasting, and searching. Automatic music sort characterization is a crucial part of music data recovery systems [2]. For humankind, it is not hard to order music in various types,however, comprehending it using PCs is a major test. There is a distinctive view that characterizes instruments, songs, beats, and moods to separate musical genres [3].

Music Genre grouping is generally significant in the music database of the executive’s framework to look through music according to class. The kind order is a profoundly free type of naming [4]. Because of the specialized turn of events and innovations of savvy compact gadgets such as cell phones, pod music has reached everyone [5]. We as often as possible run over individuals with earphones or ear plugs in transit while strolling, driving even while examining, while at the same time doing activity or yoga. As the Internet is accessible, music has been effectively and promptly opened to individuals. When we listen to melodies, we never need to listen to the full playlist. Commonly, we need to listen to melodies that suit our temperament. Therefore, in this situation, it is significant or advantageous to have a rundown of melodies according to temperament or sort. Si that the client can pursue the music from the immense database without much of a stretch.

Liu et al. [6] have worked in the categorization of MG based on visual representation and have been effectively studied in recent years. Convolutional neural networks (CNN) have recently gained popularity as a means to accomplish tasks. The proposed CNN architecture uses multi-scale time–frequency information, transferring more appropriate semantic elements to the decision-making layer to distinguish the genre of an unfamiliar music trunk. The trials are assessed using the Ballroom, GTZAN and Extended Ballroom benchmark datasets. The proposed approach can obtain 93.9 percent, 96.7 percent, and 97.2%, respectively, according to the testing findings [7, 8].

The linking of genre-related labels to digital music files has been studied by [9] in MGC. It can be used to organize both commercial and private melody collections. Music tracks are frequently characterized as a collection of sound textures stimulated by timbres. To depict a complete track, the selection of sound textures is frequently used. They assessed the influence of texture selection on automatic MGC and provided a unique texture selector based on K-Means, which identifies a variety of sound textures inside each recording [10, 11].

Music is an essential and inseparable aspect of most people's lives. Several MG exist, each of which is distinct from the others, resulting in a wide range of musical preferences. Consequently, classifying music and recommending new music in music listening apps and platforms is an essential and timely subject. One of the most effective approaches to solving this challenge is to categorize music by genre. Several methods are available for categorizing and recommending music [12].

According to [13], music genre categorization is one of the sub-disciplines of music information retrieval (MIR) that is gaining in favor among academics, owing to the open problems. Despite the fact that research has produced a large number of publications, the issue still has a fundamental flaw: the music type is not clearly defined. Music classifications are hazy and ambiguous, owing to human subjectivity and disagreement [14].

Using a transfer learning technique [15], proposed an audio-based categorization strategy for 11 western MGs, including rock, pop, rape, country, metal, jazz, blue, R&B, folk music, Electronic Music, and Classical Music. One can achieve 0.9799 ROC-AUC and 0.8938 PR-AUC on a private 1100 songs collection. Using a deep learning model [16], created an autonomous music genre categorization. CNN is used for local feature extraction, whereas Long Short-Term Memory (LSTM) sequence-to-sequence autoencoders are used to learn representations of time series data while accounting for their temporal dynamics. Using the GTZAN dataset, computational experiments yielded an overall test accuracy of 95.4 percent and precision of 91.87 percent.

Labeling a music recording according to its genre is a natural and familiar method to characterize its content, according to [17] MG are useful for organizing music, creating individualized listening experiences, and creating playlists. Due to the inherent ambiguity and subjectivity, the automatic identification of an MG is a difficult task. Most music genre categorization efforts assume total label independence. Furthermore, the local classifier per node technique, which would be the best fit in this case, is time- and memory-intensive [18].

MGC is a vital and useful aspect of music information retrieval. Deep learning is increasingly used to classify MG for two reasons [19]. The starters eliminate the need for manual audio signal feature selection. Second, the hierarchical architecture is compatible with the temporal- and frequency-domain layering structure of music. The Transformer classifier performs better in MGC [20, 21] by analyzing the connection between distinct audio frames.

Jaime Ramirez [22] developed a web program that obtains tunes from YouTube and categorizes these tunes into MG. The technique described in this paper is built on models developed on audio sets’ musical collection data. The conception output displays this sequential data in real time, in rhythm with the harmony video played the same time, with cataloguing outcomes displayed in the weighted area graph, where scores for the top-10 labels acquired for each block are given.

Automatic MGC is believed to be a basic component for obtaining music information, according to [22]. Indeed, music genre labels are extremely useful for categorizing albums, songs, and performers into groups with comparable features. To improve the recovery of MG, an accurate and effective MGC system is necessary. This research uses two key processes: feature extraction and classification, to present a novel music genre categorization model. Features such as “non-negative matrix factorization (NMF) features, Short-Time Fourier Transform (STFT) features, and pitch features were extracted during the feature extraction phase. Goulart et al. [23] proposed features extraction based on entropy fractal features and SVM classifier to classify blues, classical, and lounge genre achieved almost 100% accuracy.

Researchers Lima et al. [24] have proposed using genre labels to organize music, albums, and performers into groupings with common similarities. They describe a unique method for automatically determining the musical genre in Brazil based solely on the words of the songs. This type of categorization remains difficult in the difficulty.

To examine the influence of facet choice on the correctness of music genre categorization, Moses [25] utilized a Support Vector Machine with a radial basis function kernel as a classifier. In various feature combinations (FC), some traits are merged and matched (FC1, FC2, FC3, and FC4). The classification results demonstrated that the accuracy of each feature combination (FC) varied significantly, with the highest accuracy being 80 percent and the lowest being 67 percent. When the combination of the FC1 and FC2 features produces the same accuracy of 80%, the FC2 combination is preferred because the processing time is logically lower with fewer characteristics. For categorization, it is necessary to gather a large number of songs and index them according to their genre. The skip grammar model then locates comparable context songs for suggestions. Consequently, the proposed system functions as a complete music recommendation system that provides excellent user experience [26].

Lima et al. [27] Genre labels may be used to categorize songs, albums, and performers together based on their shared characteristics. In NLP, this type of categorization remains difficult. The F1-score parameter provides an average result of 0.48, and the BLSTM technique surpasses the other models in the studies. Gospel, funk-carioca, and sertanejo, which had F1 scores of 0.89, 0.70, and 0.69, respectively, may be classified as the most unique and easiest to categorize in the musical genres of Brazilian music settings [28].

They employed textual material given as lyrics of the accompanying song because genre dependence is not confined to the audio profile. To include acoustic information in spectrograms, they used a CNN-based feature extractor for spectrograms and Hierarchical Attention Network-based feature extractor for lyrics [29,29,31]. In several papers [32, 33], the active transfer MGC technique (ATMGCM) was proposed for musical genre categorization.

To create mathematical models for music genre categorization, [34] employed a machine learning technique created on a hybridizing genetic programming and neural networks. They created three multi-label classifiers with varied complexity and accuracy trade-offs that can detect the degree to which audio content belongs to 10 distinct MG.

According to [35], one of the goals of the music information retrieval (MIR) community is to study novel approaches and build innovative systems that can competently and successfully extract and propose songs from massive records of music content. They also show how our database can help with specific MIR tasks, and compare it to other databases available in the literature to help the MIR community.

According to the literature, genre classification requires extra attention to increase its performance metrics. The classification was based on data from the spectrograms of songs. The proposed approach will be incredibly useful in a number of industries, such as films and education. Numerous online music-playing services, such as, use MGC to improve user suggestions. Most of the research work is based on feature extraction techniques and then applying machine learning algorithms to classify the music or genres. In this work, because music is very dynamic in nature, manual extraction of features from music clips or by onset pattern detection or the music pattern at periodic time intervals does not provide accurate results. Therefore, a deep neural network is used for the automatic detection of features that train as per the given musical pattern. Guido [36, 37] proposed a mass of discrete time signal based on Teager energy operator enables more precise and detailed analysis of audio signal. This author [36, 37] also proposed a model based on entropy and deep neural network to fuse the important features information. Scalvenzi et al. [38] proposed a method using discrete wavelet packet transform with support vector machine (SVM) to gain the accuracy near to 90% for sorting semantic information.

The remainder of this paper is organized as follows. “Proposed system architecture and mathematical model” section delves deeper into the mathematical model of the planned system. The recommended technique has been applied to a variety of musical styles and yielded positive results. The proposed system design is compared to a number of algorithms, and the results are reported in the Results and Discussion section. The proposed scheme is evaluated on multiple workstations to find out the time complexity of the proposed scheme, according to [39]. “Conclusion” section is conclusion part.

Proposed system architecture and mathematical model

Figure 1 is the suggested system architecture. There are three components to the recommended architectural technique: compress, decompress, and filter CNN. Appropriate padding, with the purpose of both leveraging input features and decreasing resolution by using an appropriate stride at the end of each step run with convolutions. Fine-grained features forwarding provides the ideal estimate optimal fixed-lag. Optimal smoother provides the optimal estimate of x ˆ_t for a given fixed-lag N.

Fig. 1
figure 1

Proposed system architecture of the MGC

The main contributions of this research are as follows:

  1. 1.

    Extracting features using a Pre-trained Deep Neural Network (DNN) to evaluate the proposed method using precision, recall, F1-Score, and accuracy, and evaluating the system on selected datasets.

  2. 2.

    The parallel Convolution Deconvolution neural network model to extract finest level of features.

  3. 3.

    Removes the gradient descent problem using PreLU.

Steps in Proposed System Architecture:

The network architecture is decided on trial-and-error basis. The N no. of trials are carried out to observe the statical output. The hyperparameters: Number of input layer = 1, layer 1 = 512, layer 2 = 256, layer3 = 128, layer4 = 64 neurons, number of output layer = 1 and optimizer = adam.

  1. 1.

    In the first stage, convolution and deconvolution are performed. In deconvolution more features that are not important in convolution are extracted along with the features in the convolution. This mixture of the important and unimportant features is then fed to the convolution with filters and convolution with an element-wise sum. The proposed framework is illustrated in Fig. 1.

  2. 2.

    The reduction, decompression, and filtered CNN were the best methods.

  3. 3.

    All convolutions were performed with the required padding and, at the end of each step, a good stride to maximize the input features and reduce the resolution.

  4. 4.

    After additional convolution and pooling, the feature mapping decreased. Deconvolution is performed using the proposed methodology to obtain the original image, as shown in Fig. 1.

  5. 5.

    Deconvolution is the up-sampling of images. It generates high-resolution images and maps low-dimensional images to high-dimensional ones. A Parametric Rectified Linear Unit (PReLU) activation function is used in the convolution. PReLU handles negative inputs that the sigmoid activation function cannot handle. PReLU is a non linear activation function used as an activation function to make more differences among classes so that it improves the performance result.

  6. 6.

    The vanishing gradient problem was eliminated using PReLU.

The impulse response is obtained by multiplying the gaussian function with the sinusoidal wave. Due to multiplication-convolution condition, convolution of the Fourier Transform of the harmonic function and the FT of the Gaussian function (Convolution theorem) is the result of the Fourier transform (FT) of a filter's impulse response. In the filter, an actual and an unreal component denote orthogonal orientations. The two pieces can be used together or independently to create a complicated number.

$$x\,\left( {m,n;\lambda ,\theta ,\psi ,\sigma ,\gamma } \right)\, = \,{\text{exp}}\,\left( { - \,\frac{{m^{{{^{\prime}}2}} \, + \,\gamma^{2} n^{{{^{\prime}}2}} }}{{2\sigma^{2} }}} \right)\,{\text{exp}}\,\left( {i\,\left( {2\pi \,\frac{{m^{\prime}}}{\lambda }\, + \,\psi } \right)} \right)$$
$$x\,\left( {m,n;\lambda ,\theta ,\psi ,\sigma ,\gamma } \right)\, = \,{\text{exp}}\,\left( { - \,\frac{{m^{{{^{\prime}}2}} \, + \,\gamma^{2} n^{{{^{\prime}}2}} }}{{2\sigma^{2} }}} \right)\,{\text{cos}}\,\left( {2\pi \,\frac{{m^{\prime}}}{\lambda }\, + \,\psi } \right)$$
$$x\,\left( {m,n;\lambda ,\theta ,\psi ,\sigma ,\gamma } \right)\, = \,{\text{exp}}\,\left( { - \,\frac{{m^{{{^{\prime}}2}} \, + \,\gamma^{2} n^{{{^{\prime}}2}} }}{{2\sigma^{2} }}} \right)\,{\text{sin}}\,\left( {2\pi \,\frac{{m^{\prime}}}{\lambda }\, + \,\psi } \right)$$

where Eq. 2.1, 2.2 and 2.3 is complex, real and imaginary respectively.

$$m^{\prime}\, = \,m{\text{cos}}\theta \, + \,n{\text{sin}}\theta \;{\text{and}}\;n^{\prime}\, = \, - \,m{\text{sin}}\theta \, + \,n{\text{cos}}\theta$$

The wavelength of the sinusoidal component is represented by \(\lambda\) and the orientation of the normal to the parallel bands of a function is represented by \(\theta\) in this equation. Here, \(\sigma\) is the sigma/standard deviation of the Gaussian envelope, \(\psi\) is the phase offset, and \(\gamma\) is the spatial aspect ratio, and specifies the ellipticity of the support of the function.

$$\left[ \begin{gathered} \widehat{{\mathbf{f}}}_{t|t} \hfill \\ \widehat{{\mathbf{f}}}_{t - 1|t} \hfill \\ \vdots \hfill \\ \widehat{{\mathbf{f}}}_{t - N + 1|t} \hfill \\ \end{gathered} \right]\, = \,\left[ \begin{gathered} {\mathbf{I}} \hfill \\ 0 \hfill \\ \vdots \hfill \\ 0 \hfill \\ \end{gathered} \right]\,\widehat{{\mathbf{f}}}_{t|t - 1} \, + \,\left[ {\begin{array}{*{20}c} 0 & \cdots & 0 \\ {\mathbf{I}} & 0 & \vdots \\ \vdots & \ddots & \vdots \\ 0 & \cdots & {\mathbf{I}} \\ \end{array} } \right]\,\left[ {\begin{array}{*{20}c} {\widehat{{\mathbf{f}}}_{t - 1|t - 1} } \\ {\widehat{{\mathbf{f}}}_{t - 2|t - 1} } \\ \vdots \\ {\widehat{{\mathbf{f}}}_{t - N + 1|t - 1} } \\ \end{array} } \right]\, + \,\left[ {\begin{array}{*{20}c} {{\mathbf{U}}^{\left( 0 \right)} } \\ {{\mathbf{U}}^{\left( 1 \right)} } \\ \vdots \\ {{\mathbf{U}}^{{\left( {N - 1} \right)}} } \\ \end{array} } \right]\,g_{t|t - 1}$$

where:\(\widehat{{\mathbf{f}}}_{t|t - 1}\) is assessed by filter; \({\mathbf{g}}_{t|t - 1} \, = \,{\mathbf{z}}_{t} \, - \,{\mathbf{H}}\,\widehat{{\mathbf{f}}}_{t|t - 1}\) is the novelty formed as the evaluation of the standard filter; the numerous \(\widehat{{\mathbf{f}}}_{t - i|t}\) with \(i=1,\dots ,N-1\) are new variables; the improvements are calculated by the following scheme:

$${\mathbf{U}}^{{\left( {i + 1} \right)}} \, = \,{\mathbf{Q}}^{\left( i \right)} \,{\mathbf{D}}^{{\text{T}}} \left[ {{\mathbf{DQD}}^{{\text{T}}} \, + \,{\mathbf{R}}} \right]^{ - 1}$$


$${\mathbf{Q}}^{\left( i \right)} \, = \,{\mathbf{Q}}\left[ {\left( {{\mathbf{F}} - {\mathbf{UD}}} \right)^{{\text{T}}} } \right]^{i}$$

where \({\varvec{Q}}\) and \({\varvec{U}}\) are the correlation of prediction error and the filter gains. The estimation error covariance is calculated in such a way that

$${\mathbf{Q}}_{i} :\, = \,E\,\left[ {\left( {{\mathbf{f}}_{t - i} \, - \,\widehat{{\mathbf{f}}}_{t - i|t} } \right)^{ * } \left( {{\mathbf{f}}_{t - i} \, - \,\widehat{{\mathbf{f}}}_{t - i|t} } \right)\,|\,Z_{1} \ldots Z_{t} } \right]$$

then we get that the upgrading on the assessment of \({\mathbf{f}}_{t-i}\) is given by:

$${\mathbf{Q}}\, - \,{\mathbf{Q}}_{i} \, = \,\sum\nolimits_{j\, = \,0}^{i} {\left[ {{\mathbf{Q}}^{\left( j \right)} \,{\mathbf{D}}^{{\text{T}}} \,\left( {{\mathbf{DQD}}^{{\text{T}}} \, + \,{\mathbf{R}}} \right)^{ - 1} \,{\mathbf{D}}\,\left( {{\mathbf{Q}}^{\left( i \right)} } \right)^{{\text{T}}} } \right]}$$

A feedback pass is also used, which analyses data reserved from the filter forward pass. The backward pass equations entail recursive data computations that are used to determine the smoothed state and covariance at each observation period. The recursive equations are given as,

$$\begin{gathered} \tilde{\Lambda }_{k} \, = \,{\mathbf{D}}_{k}^{{\text{T}}} \,{\mathbf{S}}_{k}^{ - 1} \,{\mathbf{D}}_{k} \, + \,\widehat{{\mathbf{C}}}_{k}^{{\text{T}}} \,\tilde{\Lambda }_{k} \widehat{{\mathbf{C}}}_{k} \hfill \\ \widehat{\Lambda }_{k - 1} \, = \,{\mathbf{F}}_{k}^{{\text{T}}} \tilde{\Lambda }_{k} {\mathbf{F}}_{k} \hfill \\ \widehat{\Lambda }_{n} \, = \,0 \hfill \\ \tilde{\lambda }_{k} \, = \, - {\mathbf{D}}_{k}^{{\text{T}}} \,{\mathbf{S}}_{k}^{ - 1} {\mathbf{y}}_{k} \, + \,\widehat{{\mathbf{C}}}_{k}^{{\text{T}}} \,\widehat{{\lambda_{k} }} \hfill \\ \widehat{\lambda }_{k - 1} \, = \,{\mathbf{F}}_{k}^{{\text{T}}} \,\tilde{\lambda }_{k} \hfill \\ \widehat{\lambda }_{n} \, = \,0 \hfill \\ \end{gathered}$$

where \({\mathbf{S}}_{k}\) is the residual covariance and \(\widehat{{\mathbf{C}}}_{k} \, = \,{\mathbf{I}}\, - \,{\mathbf{U}}_{k} {\mathbf{D}}_{k}\). Substituting the smoothed state and covariance into the equations yields the smoothed state and covariance.

$$\begin{gathered} {\mathbf{Q}}_{k|n} \, = \,{\mathbf{Q}}_{k|k - 1} \, - \,{\mathbf{Q}}_{k|k - 1} \,\tilde{\Lambda }_{k} \,{\mathbf{Q}}_{k|k - 1} \hfill \\ {\mathbf{f}}_{k|n} \, = \,{\mathbf{f}}_{k|k - 1} \, - \,{\mathbf{Q}}_{k|k - 1} \tilde{\lambda }_{k} \hfill \\ \end{gathered}$$

For a random vector \(\mathbf{f}=\left({f}_{1},\dots ,{f}_{L}\right)\), sigma points are any set of vectors

$$\left\{ {{\mathbf{S}}_{0} , \ldots ,{\mathbf{S}}_{N} } \right\}\, = \,\left\{ {\left( {{\text{S}}_{0,\,1} \,{\text{S}}_{0,\,2} \, \cdots \,{\text{S}}_{0,\,L} } \right),\, \ldots \,,\,\left( {{\text{S}}_{N,\,1} \,{\text{S}}_{N,\,2} \, \cdots \,{\text{S}}_{N,\,L} } \right)} \right\}$$

attributed with first-order weights \({W}_{0}^{a},\dots ,{W}_{N}^{a}\) that fulfill

$$\sum\nolimits_{j\, = \,0}^{N} {W_{j}^{a} \, = \,1}$$

for all \(i=1,\dots ,L:E\left[{x}_{i}\right]={\sum }_{j=0}^{N} {W}_{j}^{a}{s}_{j,i}\)

Second-order weights \({W}_{0}^{c},\dots ,{W}_{N}^{c}\) that fulfill

$$\sum\nolimits_{j\, = \,0}^{N} {W_{j}^{c} \, = \,1}$$

for entire duos \((i,l)\in \{1,\dots ,L{\}}^{2}:E\left[{x}_{i}{x}_{l}\right]={\sum }_{j=0}^{N} {W}_{j}^{c}{s}_{j,i}{s}_{j,l}\).

A naïve choice of sigma points and masses for \({\mathbf{x}}_{k-1\mid k-1}\) in the filter algorithm is

$$\begin{gathered} {\mathbf{S}}_{0} \, = \,\widehat{{\mathbf{f}}}_{k - 1|k - 1} \hfill \\ - 1\, < \,W_{0}^{a} \, = \,W_{0}^{c} \, < \,1 \hfill \\ {\mathbf{S}}_{j} \, = \,\widehat{{\mathbf{f}}}_{k - 1|k - 1} \, + \,\sqrt {\frac{L}{{1 - W_{0} }}} \,{\mathbf{A}}_{j} ,\;j\, = \,1,\, \ldots ,\,L \hfill \\ {\mathbf{S}}_{L\, + j} \, = \,\widehat{{\mathbf{f}}}_{k - 1|k - 1} \, - \,\sqrt {\frac{L}{{1 - W_{0} }}} \,{\mathbf{A}}_{j} ,\;j\, = \,1,\, \ldots ,\,L \hfill \\ W_{j}^{a} \, = \,W_{j}^{c} \, = \,\frac{{1 - W_{0} }}{2L},\;j\, = \,1,\, \ldots ,\,2L \hfill \\ \end{gathered}$$

Given prediction estimates \(\widehat{{\mathbf{f}}}_{k|k - 1}\) and \({\mathbf{Q}}_{k\mid k-1}\), a novel set of \(N=2L+1\) sigma points \({\mathbf{s}}_{0},\dots ,{\mathbf{s}}_{2L}\) with corresponding first-order weights \({W}_{0}^{a},\dots {W}_{2L}^{a}\) and second-order weights \({W}_{0}^{c},\dots ,{W}_{2L}^{c}\) is computed. These sigma points are converted over \(h\).

$${\mathbf{Z}}_{j} \, = \,h\left( {{\mathbf{S}}_{j} } \right),\;j\, = \,0,\,1,\, \ldots ,\,2L$$

Then the experimental mean and covariance of the changed points are calculated.

$$\begin{gathered} \widehat{{\mathbf{Z}}}\, = \,\sum\nolimits_{j\, = \,0}^{2L} {W_{j}^{a} } \,{\mathbf{Z}}_{j} \hfill \\ \widehat{{\mathbf{S}}}_{k} \, = \,\sum\nolimits_{j\, = \,0}^{2L} {W_{j}^{c} \,\left( {{\mathbf{Z}}_{j} \, - \,\widehat{{\mathbf{Z}}}} \right)} \,\left( {{\mathbf{Z}}_{j} \, - \,\widehat{{\mathbf{Z}}}} \right)^{{\text{T}}} \, + \,{\mathbf{R}}_{k} \hfill \\ \end{gathered}$$

where \({\mathbf{R}}_{k}\) is the covariance matrix of the observer features, \({\mathbf{v}}_{k}\). Moreover, the cross-covariance matrix is also wanted

$${\mathbf{C}}_{{{\mathbf{SZ}}}} \, = \,\sum\nolimits_{j\, = \,0}^{2L} {W_{j}^{c} \left( {{\mathbf{S}}_{j} \, - \,\widehat{{\mathbf{f}}}_{k|k - 1} } \right)} \,\left( {{\mathbf{Z}}_{j} \, - \,\widehat{{\mathbf{Z}}}} \right)^{{\text{T}}}$$

where \({\mathbf{s}}_{j}\) are the unchanged sigma points created from \(\widehat{{\mathbf{f}}}_{k|k - 1}\) and \({\mathbf{Q}}_{k\mid k-1}\). The filter gain is given as,

$${\mathbf{U}}_{k} \, = {\mathbf{C}}_{{{\mathbf{SZ}}}} \,\widehat{{\mathbf{S}}}_{k}^{ - 1}$$

The updated covariance and mean estimates are

$$\begin{aligned} \widehat{{\mathbf{f}}}_{k|k}& = \widehat{{\mathbf{f}}}_{k|k - 1} + {\mathbf{U}}_{k} \,\left( {{\mathbf{Z}}_{k} - \widehat{{\mathbf{Z}}}} \right) \hfill \\ {\mathbf{Q}}_{k|k} & ={\mathbf{Q}}_{k|k - 1} - {\mathbf{U}}_{k} \widehat{{\mathbf{S}}}_{{\text{k}}} {\mathbf{K}}_{k}^{{\text{T}}} \\ \end{aligned}$$

In the right-side part of the network's convolutional stages, we used stacking method where output of a layer is added to deep layer. The extracted features were then sent from the initial stages to final stage scanning from the left side of the system to the right segment through horizontal linkages, as shown in Fig. 1. So, we able to improve the precision of the final silhouette prediction by finding very fine data that would have been lost during the compression stage. The merging time of model has improved as a effect of these relations.

Data preparation

For implementation the proposed method is applied on two datasets. GTZAN, Hindustani Music Rhythm (HMR), and Indian Music Genre (IMG) datasets. GTZAN has 10 genres viz blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, rock have 1000 songs. HMR has 3 genres of Hindustani Music Rhythm dataset have taals: Ektaal, Zaptaal, Rupak, Teentaal have 300 songs and IMG dataset has 5 genres Sufi, bollypop, classical, Ghazals and Carnatic music have 500 songs. For feature extraction Librosa library is used. Keras, numpy, pandas library of python is used to implement proposed method. Training and testing data is selected in 80:20 ratio that is 80 as training and 20 as testing. For tonality perception using one of percussion instruments Tabala is used here as a dataset. This dataset contains total 561 files played on Tabala with Addhatritaal, bhajani, deepchandi, dadra, rupak, zaptaal, ektaal, and tritaal with 78,72,60, 72, 60, 60, 60,99 clips. The time signature of all the taal in suggested method have different number of beats. Like Teentaal have 16 beats, Rupak have 7 beats, Ektaal 12 beats, Zaptaal have 10 beats.

Results and discussion

Planned approach is verified on 390 songs from very well-known dataset GTZAN, one from every category.

Figure 2 shows typical confusion matrix diagonal clearly contains dominating values, as shown by the confusion matrix. The confusion matrix also illustrates that the suggested design reduces the chances of misdiagnosis.

Fig. 2
figure 2

The suggested algorithm's confusion matrix on the GTZAN dataset

The performance characteristics have been determined using the confusion matrix. Accuracy, recall, and F1-score were calculated using standard mathematical techniques and presented in Table 1.

Table 1 Performance parameters of proposed algorithm on GTZAN dataset

We can simply state that the recommended construction for the sorting of songs genre is dependable since the accuracy of the categorization of music genre ranges from 99 percent to 99.85 percent. Furthermore, the minimal accuracy, recall, and F1-score values are 94 percent, 93 percent, and 96 percent, respectively. The maximum accuracy, recall and F1-score values are 100 percent, 99 percent, and 99 percent, respectively. Figure 3 depicts the performance of suggested architecture with other algorithms.

Fig. 3
figure 3

Performance parameters comparison for GTZAN dataset using proposed architecture

The suggested design is compared to some current architecture using the same dataset. The accuracy of song genre categorization is evaluated using VGG16, Alexnet, and Recurrent Neural Network (RNN), as well as a proposed architecture, as shown in Table 2 and Fig. 4.

Table 2 Accuracy of music genre categorization in the GTZAN dataset using various topologies
Fig. 4
figure 4

Comparative analysis of classification accuracy GTZAN dataset using different architectures

On the Indian classical taal (rhythm), the planned method is also verified. In Indian classical music, the tabla and dholak are the two main drumming instruments. The tabla is a duo of drums of various woods and sizes that are played together by tapping with the hands to produce a rhythmic sound. The sounds are then arranged in a variety of rhythm patterns to match musical performances. The suggested system architecture is also verified on the Indian classical taal dataset suggested by [40]. The rhythm patterns are evaluated on the basis of Taal: Ektaal, Rupak, Teentaal, and Zaptaal.

In Fig. 5 the performance parameters are shown graphically. All of the characteristics considered for the classification using the suggested proposed method are higher, according to the graph of performance parameters. The accuracy of rhythm classification is at least 92 percent for each class. Rhythmic similarity is tested by using various architecture like Alexnet, 16-layer deep neural network for CNN: VGG 16, and Recurrent Neural Network with suggested architecture which is shown in Table 3. For the classification of taals as compared to other machine learning algorithm mentioned in, proposed architecture gives best performance. Work done by [41], the proposed BP-Model and SP-Model gave less accuracy as compared to proposed method in this paper for HMR dataset.

Fig. 5
figure 5

Performance parameters comparison for the classification of Indian rhythms using proposed architecture

Table 3 Classification accuracy of Indian beats using various designs

The suggested architecture gives far better result as compared to other similar work as shown in below Table 4.

Table 4 The result compared with other similar method

Table 5 compares the proposed System Architecture (PSA) classification with those of other algorithms. This study evaluated the proposed system for the IMG dataset along with the existing neural network architectures, VGG16, AlexNet, and RNN. Table 6 shows performance measure on IMG and HMR dataset and Tabala Dataset. HMR dataset has song and music data with 4 taals: Ektaal, Rupak, zaptaal and Teentaal (Tritaal). We have also used Tabala dataset which is exclusively rhythmic dataset with taal on percussion instrument Tabala. As shown in Table 6 we can see that the accuracy and overall performance of Tabala dataset if far better than HMR dataset with the same taal. The deficiency in result is due to same Bol (lyrics) of taal at some places is same for these 4 taalas. The proposed system architecture takes 372 Mb memory for the process of classification. The memory required for the classification is approximately 4 times less than that of the RNN classifier.

Table 5 Comparison of the PSA with existing algorithms based on accuracy on the IMG dataset
Table 6 Evaluation of Indian music genre per class on the proposed model

The time complexity was tested for numerous CPUs. Typical response times for various hardware systems are shown in Fig. 6. This computational time is much less than time required for music category. Figure 7 shows that the proposed architecture uses 372 MB of memory in the system architecture. Compared with the RNN classifier, this classification requires approximately four times less memory. Figure 8 shows the learning curve of training Vs validation curve.

Fig. 6
figure 6

The time by PSA on different hardware platforms

Fig. 7
figure 7

Memory requirement Graph

Fig. 8
figure 8

Learning curve of tabala dataset


This research classifies MG using spectrograph feature values from temporal clips of selected music, as well as audio sample not known. The proposed system architecture is evaluated using two datasets: the Indian rhythms and GTZAN. The GTZAN dataset contains genres for western music, which differs significantly from Indian music. We're also focusing on rhythms in this study. As a result, we employed a rhythmic dataset with a unique pattern to recognize the various and dynamic nature of music. The GTZAN dataset has a classification accuracy of 99.41 percent, while 16-layer CNN, Alexnet, and Recurrent Neural Network have accuracy of 90.93 percent, 94.55 percent, and 91.58 percent, respectively. The suggested system architectural category has an average F1 score of 96.9%, which is significantly higher than current architectures. When evaluated on Indian rhythm, the proposed system design achieves a precision of 93.44 percent, which is higher than earlier architectures. The system's time complexity is quite low. When memory requirements for processing are taken into account, the suggested system architecture outperforms competing categorization techniques. According to the findings, the suggested method gave significant performance as compared to alternative algorithms on the selected GTZAN and Indian rhythms dataset.

Availability of data and materials

The datasets analysed during the current study are available in the [42], and (Ajay Srinivasasmurthy, et al., 2016, dataset


  1. Y Panagakis and C Kotropoulos. Music classification by low-rank semantic mappings. Eurasip Journal On Audio, Speech And Music Processing 2013.

  2. Li, Tao & Ogihara, Mitsunori & Li, Qi. A Comparative Study on Content-Based Music Genre Classification. 2003, 282–289.

  3. C Xu, NC Maddage, X Shao, F Cao, Q Tian. Musical genre classification using support vector machines. ICASSP2003, IEEE.

  4. R Mellon, D Spaeth, E Theis. Genre classification using graph representations of music, article published, 14 November 2014.

  5. P Tyagi, A Mehrotra, S Sharma, and S Kumar. Audio pattern recognition and mood detection system, springer science business media Singapore 2016 M. Pant Et Al. (Eds.), Proceedings of fifth international conference on soft computing for problem solving, advances in intelligent systems and computing.

  6. Liu C, Feng L, Liu G, et al. Bottom-up broadcast neural network for music genre classification. Multimedia Tools Appl. 2021;80:7313–31.

    Article  Google Scholar 

  7. A Ghildiyal, K Singh and S Sharma. Music genre classification using machine learning. 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), 2020; pp. 1368–1372, doi:

  8. Cheng Yu-Huei, Chang Pang-Ching, Nguyen Duc-Man, Che-Nan K. “Automatic music genre classification based on CRNN. Eng Lett. 2021;29(1):36.

    Google Scholar 

  9. Foleis JH, Tavares TF. Texture selection for automatic music genre classification. Appl Soft Comput. 2020;89:106127.

    Article  Google Scholar 

  10. P Ginsel, I Vatolkin, G Rudolph. Analysis of structural complexity features for music genre recognition. Authorized licensed use limited to: Suny At Stony Brook. Downloaded on October 04, 2020 at 02:07:12 UTC from IEEE Xplore.

  11. M Wu, X Liu. A double weighted KNN algorithm and its application in the music genre classification. 2019 6th International Conference on Dependable Systems and Their Applications (DSA).

  12. Elbir A, Aydin N. Music genre classification and music recommendation by using deep learning. Electron Lett. 2020;56(12):627–9.

    Article  Google Scholar 

  13. Ramírez J, Flores MJ. Machine learning for music genre: multifaceted review and experimentation with audio set. J Intell INF Syst. 2020;55:469–99.

    Article  Google Scholar 

  14. Pelchat N, Gelowitz CM. Neural network music genre classification. Can J Electr Comput Eng. 2020;43(3):170–3.

    Article  Google Scholar 

  15. B Liang and M Gu. Music genre classification using transfer learning. 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). 2020; pp. 392-393, doi:

  16. SS Ghosal and I Sarkar. Novel approach to music genre classification using clustering augmented learning method (CALM). Volume 29, Issue 1, 2020.

  17. ARS Parmezan, DF Silva, GEAPA Batista. A combination of local approaches for hierarchical music genre classification. Proceedings of the 21st ISMIR conference, Montreal, Canada, October 11–16, 2020.

  18. C Yuan, Q Ma, J Chen, W Zhou, X Zhang, X Tang, J Han, S Hu. Exploiting heterogeneous artist and listener preference graph for music genre classification. Poster Session A3: Multimedia search and recommendation & multimedia system and middleware, MM '20, October 12–16, 2020, Seattle, WA, USA.

  19. Y Zhuang, Y Chen, J Zheng. Music genre classification with transformer classifier. Proceedings of the 2020 4th international conference on digital signal processing. 2020; doi:

  20. YH Cheng, PC Chang, CN Kuo. Convolutional neural networks approach for music genre classification “2020 International Symposium on Computer, Consumer and Control (IS3C) | 978-1-7281-9362-5/20/$31.00 ©2020 IEEE | DOI:

  21. N Ndou, R Ajoodha, A Jadhav. Music Genre classification: a review of deep-learning and traditional machine-learning approaches. 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS) | 978-1-6654-4067-7/21/$31.00 ©2021 IEEE | DOI:

  22. Ramírez Castillo, Jaime & Flores, M. Web-Based Music Genre Classification for Timeline Song Visualization and Analysis. IEEE Access. 2021, PP. 1–1.

  23. Goulart AJH, Guido RC, Maciel CD. Exploring different approaches for music genre classification. Egypt Inform J. 2012;13(2):59–63.

    Article  Google Scholar 

  24. De Araújo Lima R, de Sousa RCC, Lopes H, Barbosa SDJ. Brazilian lyrics-based music genre classification using a BLSTM network. In: Rutkowski L, Scherer R, Korytkowski M, Pedrycz W, Tadeusiewicz R, Zurada JM, editors. Artificial intelligence and soft computing. ICAISC 2020. Lecture notes in computer science. Cham: Springer; 2020.

    Chapter  Google Scholar 

  25. RIM Setiadi De, DS Rahardwika, EH Rachmawanto, CA Sari, A Susanto, IUW Mulyono. Effect of Feature Selection on The Accuracy of Music Genre Classification using SVM Classifier. Authorized licensed use limited to: Queens University Belfast. Downloade7d on May 16,2021 at 23:53:53 UTC from IEEE Xplore.

  26. A Budhrani, A Patel, S Ribadiya. Music2Vec: Music Genre Classification and Recommendation System “Authorized licensed use limited to: University of Canberra. Downloaded on May 21, 2021 at 06:20:57 UTC from IEEE Xplore.

  27. RA Lima de, RCC Sousa de, SDJ Barbosa, HCV Lopes. Brazilian lyrics-based music genre classification using a BLSTM network. arXiv:2003.05377v1 [cs.CL] 6 Mar 2020.

  28. GP Oliveira, MO Silva, DB Seufitelli, A Lacerda, MM Moro. Detecting collaboration profiles in success-based music genre networks. Proceedings of the 21st ISMIR Conference, Montr´eal, Canada, October 11–16, 2020.

  29. A Nandy, M Agrawal. A novel multimodal music genre classifier using hierarchical attentionand convolutional neural network. arXiv:2011.11970v1 [cs.SD] 24 Nov 2020.

  30. R Ozakar, E Gedikli. Music genre classification using novel song structure derived features. Authorized licensed use limited to: Macquarie University. Downloaded on November 13,2020 at 21:41:40 UTC from IEEE Xplore.

  31. MS Ahmed, MZ Mahmud, S Akhter. Musical genre classification on the marsyas audio data using convolution NN. 2020 23rd International Conference on Computer and Information Technology (ICCIT). 19–21 December, 2020.

  32. C Chen, X Steven. Combined transfer and active learning for high accuracy music genre classification method. 2021 IEEE 2nd international conference on big data, artificial intelligence and internet of things engineering (ICBAIE 2021).

  33. S Deepak, DBG Prasad. Music classification based on genre using LSTM. Proceedings of the second international conference on inventive research in computing applications (ICIRCA-2020) IEEE Xplore Part Number: CFP20N67-ART; ISBN: 978-1-7281-5374-2.

  34. Campobello G, Dell’Aquila D, Russo M, Segreto A. Neuro-genetic programming for multigenre classification of music content. Appl Soft Comput J. 2020;94:106488.

    Article  Google Scholar 

  35. Domingues, Marcos & Pegoraro Santana, Igor & Pinhelli, Fabio & Donini, Juliano & Catharin, Leonardo & Mangolin, Rafael & Costa, Yandre & Feltrim, Valéria Delisandra. (2020). Music4All: A New Music Database and Its Applications.

  36. Guido RC. Enhancing teager energy operator based on a novel and appealing concept: signal mass. J Frankl Inst. 2018.

    Article  MATH  Google Scholar 

  37. Guido RC. A tutorial review on entropy-based handcrafted feature extraction for information fusion. Inform Fusion. 2018;41:161–75.

    Article  Google Scholar 

  38. Scalvenzi RR, Guido RC, Marranghello N. Wavelet-packets associated with support vector machine are effective for monophone sorting in music signals. Int J Semant Comput. 2019;13(3):415–25.

    Article  Google Scholar 

  39. AS Ladkat, AA Date and SS Inamdar. Development and comparison of serial and parallel image processing algorithms. 2016 International Conference on Inventive Computation Technologies (ICICT), 2016, pp. 1-4, doi:

  40. A Srinivasasmurthy, A Holzapfel, AT Cemgil, X Serra. A generalized Bayesian model for tracking long metrical cycles in acoustic music signals. in Proc. of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, March 2016.

  41. A Srinivasamurthy, A Holzapfel, X Serra. Informed automatic meter analysis of music recordings. 18th international society for music information retrieval conference, Suzhou, China, 2017.

  42. Tzanetakis G, Cook PR. Musical genre classification of audio signals. IEEE Trans Speech Audio Process. 2002;10:293–302.

    Article  Google Scholar 

  43. Hongdan W, SalmiJamalib S, Zhengping C, Qiaojuana S, Le R. An intelligent music genre analysis using feature extraction and classification using deep learning techniques. Comput Electr Eng. 2022;100:107978.

    Article  Google Scholar 

  44. Vaibhavi M, Krishna PR. Music genre classification using neural networks with data augmentation a make in India creation. J Innov Sci Sustain Technol. 2021;1:21–37.

    Google Scholar 

  45. Yang R, Feng L, Wang H, Yao J, Luo S. Parallel recurrent convolutional neural networks based music genre classification method for mobile devices. IEEE Access. 2020.

    Article  Google Scholar 

  46. HBN Hettiarachchi, LS Lekamge, J Charles. A data mining approach, international conference on advances in computing and technology (ICACT–2020), ISSN 2756–9160 / November 2020.

  47. AJH Goulart, CD Maciel, RC Guido, KCS Paulo and IN Silva da. Music genre classification based on entropy and fractal lacunarity. 2011 IEEE International Symposium on Multimedia, Dana Point, CA, USA, 2011, pp. 533-536, doi:

  48. Elachkar C, Couturier R, Atéchian T, Makhoul A. Combining reduction and dense blocks for music genre classification. In: Mantoro T, Lee M, Ayu MA, Wong KW, Hidayanto AN, editors. Neural information processing. ICONIP 2021. Communications in computer and information science. Cham: Springer; 2021.

    Google Scholar 

  49. SH Chen, SH Chen, RC Guido. Music genre classification algorithm based on dynamic frame analysis and support vector machine. 2010 IEEE International Symposium on Multimedia.

  50. Kumaraswamy, Balachandra & Poonacha, P.G. Deep Convolutional Neural Network for musical genre classification via new Self Adaptive Sea Lion Optimization. Applied Soft Computing. 2021, 108. 107446.

Download references


This research is done in G H Raisoni Institute of Engineering and Business Management, Jalgaon, Maharashrta, India.


Not applicable.

Author information

Authors and Affiliations



Author designed the proposed system, implemented the designed system and analysed the result on different datasets. Authors also wrote the manuscript and checked by all authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Swati A. Patil.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Patil, S.A., Pradeepini, G. & Komati, T.R. Novel mathematical model for the classification of music and rhythmic genre using deep neural network. J Big Data 10, 108 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: