 Research
 Open access
 Published:
Unit middleware for implementation of human–machine interconnection intelligent ecology construction
Journal of Big Data volume 10, Article number: 107 (2023)
Abstract
General speech recognition models require large capacity and strong computing power. Based on small capacity and low computing power to realize speech analysis and semantic recognition is a research area with great challenges for constructing intelligent ecology of the Internet of Things. For this purpose, we set up the unit middleware for the implementation of human–machine interconnection, namely human–machine interaction based on phonetics and semantics control for constructing intelligent ecology of the Internet of Things. First, through calculation, theoretical derivation and verification we present a kind of novel deep hybrid intelligent algorithm, which has realized speech analysis and semantic recognition. Second, it is to establish unit middleware using the embedded chip as the core on the motherboard. Third, it is to develop the important auxiliary tools writerburner and crosscompiler. Fourth, it is to prune procedures and system, download, burn and write the algorithms and codes into the unit middleware and crosscompile. Fifth, it is to expand the functions of the motherboard, provide more components and interfaces, for example including RFID(Radio Frequency Identification, RFID), ZigBee, WiFi, GPRS(General Packet Radio Services, GPRS), RS232 serial port, USB(Universal Serial Bus, USB) interfaces and so on. Sixth, we take advantage of algorithms, software and hardware to make machines "understand" human speech and "think" and "comprehend" human intentions so as to implement human–machine interconnection, which further structure the intelligent ecology of the Internet of Things. At last, the experimental results denote that the unit middleware have very good effect, fast recognition speed, high accuracy and good stability, consequently realizing the intelligent ecology construction of the Internet of Things.
Introduction
With the coming of intelligent era, traditional interaction modes such as the switch, button, keyboard, mouse, touch screen and so on have had increasing difficulties for meeting the growing needs of people for intelligent computing and control. The Internet of Things with intelligent interconnections of “thing to thing” and “thing to human” is believed to be the fourth wave of world information industry development, from intelligent vehicles, intelligent transportation, wisdom logistics, intelligent offices, intelligent furniture and so on to almost all wisdom agriculture and industries. Fast, convenient, intelligent, and the integration of people and things are its biggest characteristics. Therefore, it is becoming more and more urgent to explore and implement new ways of interaction, provide a feasible solution or equipment. So we set up the unit middleware for the implementation of human–machine interconnection, namely human–machine interaction based on phonetics and semantics control for constructing intelligent ecology of the Internet of Things.
For unit middleware, to realize speech analysis and semantic recognition based on small capacity and low computing power is the key. Recognition model, performance and capacity and computing power, they both relate to and influence each other. Generally, large model and good performance require large capacity and strong computing power, and vice versa. General speech recognition models have many parameters, use a lot of data, and take a long time to train and test, which requires large capacity and strong computing power. In addition, this is also an optimization or optimization problem with constraints. Simultaneously different applications have different requirements. Obviously, based on small capacity and low computing power to realize speech analysis and semantic recognition is a research area with great challenges. In order to base on small capacity and low computing power to realize speech analysis and semantic recognition, we present a kind of novel deep hybrid intelligent algorithm. We embed it in three main ways: First, the algorithm must be able to extract features efficiently, reduce the redundancy of data, and improve the recognition rate and stability. Second, the algorithm must have a certain degree of elasticity and flexibility, easy to expand and clip, so that the recognition model is small and light, and reduce the requirements of computing power and capacity. Third, speech data has the characteristics of serialization and nomodularization, there is a strong correlation and dependence before and after data. For different speech sequences, the dependency can be set to varying lengths. Greater length may result in greater accuracy, but requires greater computing power and capacity; conversely, it may result in lower accuracy, but requires less computing power and capacity. Therefore, the serialization model is used to better obtain the dependencies between sequential words and make corresponding choices according to the actual needs. Second, it is to establish unit middleware using the embedded chip as the core on the motherboard. Third, it is to develop the important auxiliary tools writerburner and crosscompiler. Fourth, it is to prune procedures and system, download, burn and write the algorithms and codes into the unit middleware and crosscompile. Fifth, it is to expand the functions of the motherboard, provide more components and interfaces, for example including RFID, ZigBee, WiFi, GPRS, RS232 serial port, USB interfaces and so on. Sixth, we take advantage of algorithms, software and hardware to make machines "understand" human speech and "think" and "comprehend" human intentions so as to implement human–machine interconnection, which further structure the intelligent ecology of the Internet of Things. At last, the experimental results denote that the unit middleware have very good effect, fast recognition speed, high accuracy and good stability, consequently realizing the intelligent ecology construction of the Internet of Things.
Previous foreign and domestic studies
The research of this paper is multidisciplinary cross research and the content is many, difficult, needs various aspects of professional knowledge, which includes speech recognition and semantic controls, deep hybrid intelligent algorithm, human–machine interactions, artificial intelligence, the Internet of Things, embedded development and so on. It's also a combination of algorithms, hardware and software. Although all of them are the current research hotspot respectively, the application realization and research of their combination have not been found. In particular, based on small capacity and low computing power to realize speech analysis and semantic recognition is a research area with great challenges for constructing intelligent ecology of Internet of Things. Previous papers need to be explored for each relevant domain.
At present, speech recognition semantic control should be the most suitable way of human–machine interaction. For example, speech recognition semantic control can be applied to indoor equipment controls, voice control telephone exchange, intelligent toys, industrial controls, home services, hotel services, banking services, ticketing systems, information web queries, voice communication systems, voice navigation and so on all kinds of voice control systems and selfhelp customer service systems. In particular, with the vigorous development of artificial intelligence technology [1,2,3,4], compared to traditional human–machine interaction modes, which mainly include using keyboards, mice and so on to communicate, people naturally expect that machines will have highly intelligent voice communication abilities, named intelligent machines, which can "understand" human speech, "think" and "comprehend" human intentions, and finally respond to the speech or actions. This has always been one of the ultimate goals of artificial intelligence, which is also one of critical components to structure the intelligent interconnections of the Internet of Things [5,6,7,8,9,10,11,12]. Intelligent voice interaction technology has involuntarily become one of the current research hotspots. Until 2006, there were no big breakthroughs in speech recognition. All along the most representative identification methods are respectively the feature parametermatching method, HMM(Hidden Markov Model, HMM) and other key technologies based on HMM, for example MAP(Maximum Aposteriori Probability, MAP) estimation criterion [13] and MLLR(Maximum Likelihood Linear Regression, MLLR) [14]. After Hinton et al. presented the layerbylayer greedy unsupervised pretraining deep neural network named deep learning in 2006 [15,16,17,18,19], speech recognition was starting to make some breakthroughs. Microsoft had applied successfully it to its own speech recognition system. It achieved a reduction in the error rate of word recognition by approximately 30% compared to previous optimal methods [20, 21], which was a major breakthrough for speech recognition. At present, many domestic and foreign research institutions for example Xunfei, Microsoft, Google, IBM and so on are all also actively pursuing research targeted for deep learning [22]. So far, hundreds of neural networks have been proposed, such as SOFM(SelfOrganizing Feature Mapping, SOFM), LVQ(Learning Vector Quantization, LVQ), LAM(Local Attention Mechanism, LAM), RBF(Radial Basis Function, RBF), ART(Adaptive Resonance Theory, ART), BAM(Bidirectional Associative Memory, BAM), CMAC(Cerebellar Model Articulation Controller, CMAC), CPN(Counter Propagation Network, CPN), quantum neural network, fuzzy neural network and so on [23, 24]. In particular, in 1995, Y. LeCun et al. proposed CNN (Convolution Neural Network, CNN) [25, 26]. In 2006, Hinton et al. proposed DBN (Deep Belief Network, DBN) [24] that used RBM (Restricted Boltzmann Machine, RBM) [27] as the construction module. Rumelhart, D.E. proposed AENN (Automatic Encoding Neural Network, AENN) [28, 29]. At the same time, some other neural networks were proposed based on these models, for example SDBN (Sparse Deep Belief Network, SDBN) [30], SSAE (Sparse Stack Automatic Encoders, SSAE) [31], DCGAN (Deep Convolution Generative Adversarial Network, DCGAN) [32] and so on. All of these have become main constituent models of deep neural networks, namely, deep learning [33, 34].
The concept of the IoT (Internet of Things, IoT) was first proposed by Professor Ashton in 1999 [35]. He presented the "intelligent interconnection of thing to thing", which uses information sensor equipment to collect information in real time and constitutes a huge network combined with the Internet [36,37,38,39,40,41]. As early as 1999, the Chinese Academy of Sciences had launched research on the sensor network and has already made significant progress in terms of wireless intelligent sensor network communication technology, microsensors, sensor terminals, mobile base stations and so on [42]. In 2010, the Beijing municipal government launched the first demonstration project of the Internet of Things of the "perception of Beijing".
An embedded system is a kind of dedicated computer system with an application as the center. It is based on computer technology, can tailor software and hardware and can adapt to the application system that has stringent requirements on functions, reliability, costs, volume, power consumption and so on [43, 44]. An embedded processor is the core of an embedded system. It is the hardware unit that controls and assists the system’s operations. The popular system architecture includes EMP(Embedded Microprocessor Unit, EMP), EMCU (Embedded Micro Controller Unit, EMCU), EDSP (Embedded Digital Signal Processors, EDSP), ESoC (Embedded Systems on Chip, ESoC) and so on for a total of four kinds [45].
ESR (Embedded Speech Recognition, ESR) refers to where all speech recognition processing is performed on the target device. The traditional speech recognition system generally adopts the acoustic model, which is based on the GMMHMM (Gaussian Mixture Model—Hidden Markov Model, GMMHMM) or the ngram language model. In recent years, with the rise of deep learning, the acoustic model and language model that are based on deep neural networks have separately achieved significant performance improvements compared with the traditional GMMHMM and ngram models [46,47,48,49]. Automatic speech recognition based on an embedded mobile platform is one of the key technologies.
The remainder of this paper is organized as follows. "Principle of speech recognition control and mathematical theory model" section discusses the principle of speech recognition control and the mathematical theory model. "Deep hybrid intelligent algorithm" section introduces the novel deep hybrid intelligent algorithm and training methods. The experimental results are presented and discussed in "Experiments and result analysis" section. "Summary and prospect" section provides the concluding remarks and prospects.
Principle of speech recognition control and mathematical theory model
Although there are different degrees of complexity, the principle of speech recognition in all languages is the same. Therefore, we choose Chinese for speech recognition and semantic control, then to realize human–machine interaction control.
The speech semantic recognition mainly includes the following steps: speech input, data acquisition, feature extraction, encoding and decoding and speech to semantic recognition, as shown in Fig. 1. Through the statistical theory and principle, to use conditional probability, prior probability, posterior probability and so on, it can establish the relationship between words \(W\) and speech signal \(O\), namely which can be considered as solving the problem of MAPP(Maximum A Posteriori Probability, MAPP) [13]. Through processing and transformation, it is to be able to get a sequence for speech feature vectors \(O\), for finding the maximum of a posteriori probability, to establish the following formula:
And calculate the posteriori probabilities of all possible word sequences and maximize, where \(W*\) is the maximum probability, \(\tau\) is a collection of all words. Because the \(P(O)\) is constant, on the other hand, if \(W\) is determined, the \(O\) is uniquely determined, so the conditional probability \(P(OW)\) is equal to \(1\). By the formula (1) we can get formula (2):
The \(W\) is the string sequence, so \(P(W)\) can be decomposed into:
By the formula (2) and formula (3) we can get formula (4):
where \(w_{i}\) is the \(ith\) word of the string, \(n\) is the total number of words, and \({\varvec{\omega}}^{i  1}\) represents the word sequence \(w_{i  1} ,w_{i  2} , \cdots ,w_{1}\).
Considering the large number of words, it's hard to calculate directly the conditional probability \(P(w_{i} {\varvec{\omega}}^{i  1} )\). Therefore, a finite number of words are selected as the calculation range. Namely, the ngram (n elements grammar, ngram) model is widely used, for example 2g, 3g and so on. It is to assume that the conditional probability \(P(w_{i} {\varvec{\omega}}^{i  1} )\) is only related to the preceding \(n  1\) words. As a result, it can be simplified as:
Thus, using the binary grammar model, namely, 2g, the \(P(W)\) can be approximated as follows:
Deep hybrid intelligent algorithm
Deep training and residual
The mean square error equation can be expressed as:
By taking the partial derivative of this equation with respect to each variable, the value called the "residual" is being calculated for each unit, and is denoted as \(\delta_{i}^{(l)}\). First of all, it can get the residuals of the units in output layer:
Once again, the residual of the individual unit in other layers, for example the layer \(l = n_{l}  1,n_{l}  2, \cdots ,2\), can also be obtained, for the residuals of the layer \(l = n_{l}  1\):
where \(W\) is the weight, \(b\) is the bias, \((x,y)\) is the sample, \(h_{W,b} (x)\) is the final output and \(f(\cdot)\) is the activation function. Further the relationship between residuals of units at two adjacent layers can be obtained:
At last, by all of these formulas it can realize the learning and training of the novel deep hybrid intelligent algorithm, namely:
DBNESR (deep belief network embedded with softmax regress, DBNESR)
The DBN uses the RBM [50, 51] of unsupervised learning networks as the basis for the multilayer learning systems and uses a supervised learning algorithm named BP (BackPropagation, BP) for finetuning after the pretraining. Its architecture is shown in Fig. 2. The deep architecture is a fully interconnected directed belief network with one input layer \(v^{1}\), parameter space \(W = \{ W^{1} ,W^{2} , \cdot \cdot \cdot ,W^{N} \}\), hidden layers \(h^{1}\), \(h^{2}\),\(\cdot \cdot \cdot\), \(h^{N}\), and one labelled layer at the top. The input layer \(v^{1}\) has \(D\) units, which is equal to the number of features of the samples. The label layer has \(C\) units, which is equal to the number of classes of label vector \(Y\). The numbers of units for the hidden layers are currently predefined according to the experience or intuition. The goal of the mapping function here is transformed to the problem of finding the parameter space \(W = \{ W^{1} ,W^{2} , \cdot \cdot \cdot ,W^{N} \}\) for the deep architecture [52].
The semisupervised learning method based on the DBN architecture can be divided into two stages. First, the DBN architecture is constructed by greedy layerwise unsupervised learning using the RBM as the basis. All samples are utilized to find the parameter space \(W\) with \(N\) layers. Second, the DBN architecture is trained according to the loglikelihood using the gradient descent method. Since it is difficult to optimize a deep architecture by using supervised learning directly, the unsupervised learning stage can abstract the feature effectively, and prevent overfitting of the supervised training. The BP algorithm is used to pass the error from the topdown for finetuning after the pretraining.
For unsupervised learning, it defines the energy of the joint configuration \((h^{k  1} ,h^{k} )\) as [53]:
where \(\theta = (W,b,c)\) are the model parameters.\(w_{{_{ij} }}^{k}\) is the symmetric interaction term between unit \(i\) in layer \(h^{k  1}\) and unit \(j\) in layer \(h^{k}\), \(k = 1, \cdot \cdot \cdot ,N  1\). \(b_{{_{i} }}^{k  1}\) is the \(ith\) bias of layer \(h^{k  1}\) and \(c_{j}^{k}\) is the \(jth\) bias of layer \(h^{k}\). \(D^{k}\) is the number of units in the \(kth\) layer. The network assigns a probability to every possible data point via this energy function. The probability measures the likelihood that a training data point can be raised by adjusting the weights and biases to lower the energy of that data and to raise the energy of similar, confabulated data that \(h^{k}\) would prefer to the real data. When it inputs the value of \(h^{k}\), the network can learn the content of \(h^{k  1}\) by minimizing this energy function.
The probability that the model assigns it to an \(h^{k  1}\) is:
where \(Z(\theta )\) denotes the normalizing constant. The conditional distributions over \(h^{k}\) and \(h^{k  1}\) are given as:
The probability that a turning unit \(j\) is a logistic function of the states \(h^{k  1}\) and \(w_{{_{ij} }}^{k}\) is:
The probability that a turning unit \(i\) is a logistic function of the states of \(h^{k}\) and \(w_{{_{ij} }}^{k}\) is:
In this, the logistic function that has been chosen is the sigmoid function:
The derivative of the loglikelihood with respect to the model parameter \(w^{k}\) can be obtained from Eq. (13):
where \(\langle \cdot \rangle_{{p_{0} }}\) denotes an expectation with respect to the data distribution and \(\langle \cdot \rangle_{{p_{Model} }}\) denotes an expectation with respect to the distribution defined by the model [54]. The expectation \(\langle \cdot \rangle_{{p_{Model} }}\) cannot be computed analytically. In practice, \(\langle \cdot \rangle_{{p_{Model} }}\) is replaced by \(\langle \cdot \rangle_{{p_{1} }}\), which denotes a distribution of samples when the feature detectors are being driven by reconstructed \(h^{k  1}\). This is an approximation of the gradient of a different objective function called CD (Contrastive Divergence, CD) [55]. The use of the Kullback–Leibler distance to measure two probability distribution "diversity", which is represented by \(KL(PP^{^{\prime}} )\), is shown in Eq. (21):
where \(p_{0}\) denotes joint probability distribution of the initial state of the RBM network, \(p_{n}\) denotes the joint probability distribution of the RBM network after \(n\) transformations of the MCMC (Markov Chain Monte Carlo, MCMC), and \(p_{\infty }\) denotes the joint probability distribution of the RBM network at the ends of the MCMC. Therefore, the \(CD_{n}\) can be regarded as a measure location for \(p_{n}\) between \(p_{0}\) and \(p_{\infty }\). It constantly assigns \(p_{n}\) to \(p_{0}\) and gets a new \(p_{0}\) and \(p_{n}\). The experiments show that \(CD_{n}\) will tend to zero and that the accuracy is approximated by the MCMC after setting the slope for \(r\) times for the correction parameter \(\theta\). The training process of the RBM is shown in Fig. 3.
We can get Eq. (22) through the training process of the RBM using CD:
where \(\eta\) is the learning rate. Then, the parameter can be adjusted through:
where \(\mu\) is the momentum.
The above discussion is based on the training of the parameters between the hidden layers with one sample \(x\). For unsupervised learning, it constructs the deep architecture using all samples by inputting them one by one from layer \(h^{0}\) and training the parameters between \(h^{0}\) and \(h^{1}\). Then, \(h^{1}\) is constructed, the value of \(h^{1}\) is calculated by \(h^{0}\), and the trained parameters are between \(h^{0}\) and \(h^{1}\). It can also use it to construct the next layer \(h^{2}\) and so on. The deep architecture is constructed layer by layer from the bottom to the top. In each iteration, the parameter space \(W^{K}\) is trained by the calculated data in the \((k  1)th\) layer. According to the \(W^{K}\) calculated above, the layer \(h^{k}\) is obtained as below for a sample \(x\) fed from the layer \(h^{0}\):
For supervised learning, the DBM architecture is trained by \(C\) labelled data. The optimization problem is formulized as:
This is done to minimize crossentropy. In the equation, \(p_{k}\) denotes the real label probability and \(\hat{p}_{k}\) denotes the model label probability.
The greedy layerwise unsupervised learning is used solely to initialize the parameter of the deep architecture, and the parameters of the deep architecture are updated based on Eq. (23). After the initialization, real values are used in all the nodes of the deep architecture. It uses gradientdescent through the whole deep architecture to retrain the weights for an optimal classification.
DLSTM(deep long short term memory network, DLSTM)
Speech signals are serialized data with the characteristics of consistency and causality, so the serialization model is used to better obtain the dependencies between sequential words. To do this, we present a DLSTM [50] integrated with the DBNESR to constitute a kind of novel deep hybrid intelligent algorithm, which has the advantages of the dependency of data sequence before and after, the dimension reduction and overcoming the disadvantage of gradient disappearance or gradient explosion. It can also realize the function of memory even for superlong sequences, so as better to model and perform speech recognition and semantic control. The schematic diagram of DLSTM is shown in Fig. 4.
In the recurrent neural network, the final gradient of the weight array \(W\) is the sum of the gradients at each moment, namely:
In this formula, there will appear that the gradient is almost zero at a certain moment, thus making no contribution to the final gradient value, and the previous state is suddenly gone. That is, the longdistance dependence cannot be processed. For this reason, a unit state \(c\) is added to preserve the longterm state, at the same time to use the gate mechanism to control the contents of the \(c\), respectively called the forgettinggate, which is expressed as follows:
where, \(W_{f}\) denotes the weight matrix, \([h_{t  1} ,x_{t} ]\) denotes joining two vectors \(h_{t  1}\) and \(x_{t}\) together, \(b_{f}\) is the bias, and \(\sigma\) is the activation function. And the inputtinggate, expressed as:
Based on the previous output and the current input, the cell state used to describe the current input can be derived:
And then the unit state at the current moment can be calculated:
The notation \(\circ\) means multiply by the elements. And the outputtinggate can be expressed as:
The final output of DLSTM is determined by the outputtinggate and the cell state:
There are eight groups of parameters to be learned for DLSTM training, namely, the forgettinggate, inputtinggate, outputtinggate and the weight and bias of computing unit state: \(W_{f}\) (\(W_{fh}\) and \(W_{fx}\)) and \(b_{f}\), \(W_{i}\) (\(W_{ih}\) and \(W_{ix}\)) and \(b_{i}\), \(W_{o}\) (\(W_{oh}\) and \(W_{ox}\)) and \(b_{o}\)_{,} \(W_{c}\) (\(W_{ch}\) and \(W_{cx}\)) and \(b_{c}\). Since DLSTM has four weighted inputs, it is assumed that the error term is the derivative of the loss function with respect to the output value, as shown below:
During training, the error term is transmitted in reverse direction along the time, and the error term at time \(t  1\) is set as:
Using Eqs. (30), (32) and the full derivative formula, it can be obtained:
Solve each partial derivative in Eq. (33), and obtain:
According to the definitions of \(\delta_{o,t}\),\(\delta_{f,t}\), \(\delta_{i,t}\) and \(\delta_{{\mathop c\limits^{ \sim } ,t}}\), we can get:
Equations (36) and (37) are the formula to make the error backpropagated for one moment along time, so the formula for the error term to be transmitted forward to any \(k\) moment can be obtained:
At the same time the formula for transmitting the error to the upper layer can also be obtained:
For the gradients of \(W_{oh}\), \(W_{fh}\), \(W_{ih}\) and \(W_{ch}\) and the gradients of \(b_{o}\)_{,}, \(b_{i}\), \(b_{f}\) _{and} \(b_{c}\), which are all the sum of the gradients of theirs at each moment, and the final gradients are finally obtained:
For the gradients of \(W_{ox}\), \(W_{fx}\), \(W_{ix}\) and \(W_{cx}\), which only need to be directly calculated according to the corresponding error term:
Through the above gradients, the values of weight and bias can be changed so as to realize the training of DLSTM.
Experiments and result analysis
Experimental environment

Hardware: The motherboard, which has integrated development environment, including the core processing unit, memory, various interfaces, onboard speech processing module that can amplify, filter, sample, convert with A/D(Analog to Digital Converter, A/D) or D/A(Digital to Analog Converter, D/A) and digitize the speech signal, MIC(Messages Integrity Check, MIC), ZigBee, RFID, GPRS, WiFi, RS232, USB and so on.

Software: Linux system for the embedded development, combining with the important auxiliary tools SecureCRT and ESP8266, which are all developed by us respectively for downloading, crosscompiling, burning and writing the algorithms and codes and other data.
Experimental process and results
The implementation process of speech recognition semantics control is shown below. First, it is to get voice signals from audio files or input devices, make A/D conversions, encode and decode, learn and train by the novel deep hybrid intelligent algorithm. Second, get corresponding semantic vocabularies, realize language semantic conversion. Third, basing on the semantic information, the system achieves corresponding I/O output controls by the system call functions and performs related system operations. For example, it can realize the operations of turning on and off LED (LightEmitting Diode, LED) lights for corresponding equipment. To do this, the system should implement at least the “open”, “read”, “write” and “close” and so on system operations [56,57,58]. In the experiment, we also refer to the developmental boards of YueQian and the phonetic components of Hkust XunFei.
The intelligent control system being implemented in this paper has more functionality. It can realize a wider range of recognition and interaction, for example the recognition vocabularies for one, binary, three, four, five and multiple, and the voice data respectively from audio files, the microphone input devices or mobile phone terminals by WiFi and so on. The experimental results are shown below.

1.
First of all, we have done experiments for the recognitions of a variety of vocabularies, for example one, binary, three, four, five and multiple, which are respectively from the audio file or the microphone input device, for generality and validity, for 30 times, the recognition results are shown in Tables 1 and 2 respectively. As you can see from the results in both tables: Except the first time for the recognition of the multiple vocabularies from the microphone it was wrong because of initialization, the system has achieved very good and stable recognition results, the recognition rate almost reached 100%.

2.
For example, the recognition of voice data “Turning on light” and “Turning off light” for implementing intelligent interactive control, in the experiment, we have used six lights with ID(Identity Document, ID) numbers corresponding from No.1 to 6 to realize the operations of turning on or off and the switch of any light, such as No.3 and No.6. The correct operation is denoted as 1, and the wrong operation is denoted as 0. Each experiment is repeated for 30 times respectively for each light. For being more general and authentic, we have again used two types of circuit boards with these lights for the experiments, respectively named categoryIanII. All recognition and interaction results are respectively shown in Tables 3 and 4. From the results in these tables, we can see: The speech recognition semantics control system for the audio file on categoryIandIIcircuit boards has all achieved very good and stable recognition and interaction results, the recognition and interaction rate reached 100%.

3.
And for the speech recognition semantics control system for the microphone on categoryIandIIcircuit boards, all recognition and interaction results are respectively shown in Tables 5 and 6. From the results in these tables, we can see: The system has also achieved very good and stable recognition and interaction results, the recognition and interaction rate also reached 100%.

4.
The speech recognition semantics control system for voice data from mobile phone terminals by WiFi on categoryIandIIcircuit boards has respectively occurred an error recognition, namely No.1 light on the first time and No.6 light on the first time. The recognition and interaction results are slightly worse, but the recognition and interaction rate is also close to 100%, namely 99.4444%. The main reason is that the signal is not stable when WiFi is first connected. All recognition and interaction results are respectively shown in Tables 7 and 8.

5.
In addition to achieving very good and stable recognition and interaction results, we have also measured the time it took to identify. In order to have more ways for human–machine interaction, the paper has realized many kinds of recognition, namely based on audio files, based on microphones and based on mobile phone terminals so on. Considering the process of information processing, it's intuitive that the time of recognition based on mobile phone terminals is a little longer, the second is based on microphones and the minimum is based on audio files. So for that, let's take the middle one, namely based on microphones for experiments. Each experiment is repeated for 20 times respectively for each light. Results are shown in Tables 9 and 10. The unit of time is the second and millisecond, among them 1 s = 1000 ms. It can be seen that all the recognition time are less than one second, which should be very good, namely being completely able to meet the actual needs.

6.
In order to show the change of recognition time, we obtained the curve diagrams for Figs. 5 and 6. This can be seen from Fig. 5: All the recognition time are less than one second, in particular, even though the maximum recognition time is also only for 0.982 s. And the shortest recognition time is even shorter, for 0.447 s. The average time of all recognitions is 0.7493 s. So they are very good, namely being completely able to meet the actual needs. The same it can be seen from Fig. 6: All the recognition time are also less than one second, the maximum recognition time is also only for 0.968 s. And the shortest recognition time is 0.624 s. The average time of all recognitions is 0.7767 s. The same is true of which are be very good, namely being completely able to meet the actual needs.

7.
For each light, we again get their average recognition time, which are: 0.84785, 0.78010, 0.69420, 0.67705, 0.71850, 0.77810 and 0.78040, 0.84755, 0.78865, 0.74245, 0.77340, 0.72800, the bar charts are shown in Figs. 7 and 8. This can be seen from these values and figures: All mean recognition time is also less than one second, and there's very little variation between them, which shows that the recognition and interaction performance of the system is good, and very stable, namely being completely able to meet the actual needs.
Except for voice data recognition above, it can implement recognition for almost any vocabulary of two kinds of meaning, for example “Up and Down”, “Left and Right”, “Before and After”, “Go and Stop”, “Black and White” and so on. So the vocabulary size is very big, which are enough to meet the needs of almost all applications of Internet of Things, namely to implement human–machine interaction based on phonetics and semantics control for constructing intelligent ecology of the Internet of Things.
Summary and prospect
The implementation of general speech recognition models only pursues performance, without considering capacity and computing power, which usually requires large capacity and strong computing power. Based on small capacity and low computing power to realize speech analysis and semantic recognition is a research area with great challenges for constructing intelligent ecology of the Internet of Things. For this purpose, we set up the unit middleware for the implementation of human–machine interconnection based on small capacity and low computing power, namely human–machine interaction based on phonetics and semantics control for constructing intelligent ecology of the Internet of Things. First, through calculation, theoretical derivation and verification we present a kind of novel deep hybrid intelligent algorithm, which has realized speech analysis and semantic recognition. Second, it is to establish unit middleware using the embedded chip as the core on the motherboard. Third, it is to develop the important auxiliary tools writerburner and crosscompiler. Fourth, it is to prune procedures and system, download, burn and write the algorithms and codes into the unit middleware and crosscompile. Fifth, it is to expand the functions of the motherboard, provide more components and interfaces, for example including RFID, ZigBee, WiFi, GPRS, RS232 serial port, USB interfaces and so on. Sixth, we take advantage of algorithms, software and hardware to make machines "understand" human speech and "think" and "comprehend" human intentions so as to implement human–machine interconnection, which further structure the intelligent ecology of the Internet of Things. At last, the experimental results denote that the unit middleware have very good effect, fast recognition speed, high accuracy and good stability, consequently realizing the intelligent ecology construction of the Internet of Things.
Recognition model, performance and capacity and computing power, they both relate to and influence each other. Generally, large model and good performance require large capacity and strong computing power, and vice versa. Obviously, speech recognition based on small capacity and small computing power is much harder and a big challenge. Different applications have also different requirements. In addition, this is also an optimization or optimization problem with constraints. Further reveal their relationship and law, such as how to do this better and quantitative relationship, which are the directions of our future efforts.
Availability of data and materials
All data generated or analysed during this study are included in this published article.
References
Wang W, Huang H, Yin Z, Gadekallu TR, Alazab M, Su C. Smart contract tokenbased privacypreserving access control system for industrial internet of things. Digit Commun Netw. 2022. https://doi.org/10.1016/j.dcan.2022.10.005.
Hwang CL, Weng FC, Wang DS, Wu F. Experimental validation of speech improvementbased stratified adaptive finitetime saturation control of omnidirectional service robot. IEEE Trans Syst Man Cybern Syst. 2022;52(2):1317–30. https://doi.org/10.1109/TSMC.2020.3018789.
Liu R, Liu Q, Zhu H, Cao H. Multistage deep transfer learning for EmIoTenabled humancomputer interaction. IEEE Internet Things J. 2022. https://doi.org/10.1109/JIOT.2022.3148766.
C. Zhang. Intelligent Internet of things service based on artificial intelligence technology [C], 2021 IEEE 2nd international conference on big data, artificial intelligence and internet of things engineering (ICBAIE), 2021, pp. 731–734, https://doi.org/10.1109/ICBAIE52039.2021.9390061.
Jin Xu, Yang G, Yin Y, Man H, He H. Sparserepresentationbased classification with structurepreserving dimension reduction. Cogn Comput. 2014;6(3):608–21.
Q. Yue. Research on Smart City Development and Internet of things industry innovation in the “Internet +” Era [C], 2021 third international conference on inventive research in computing applications (ICIRCA). 2021; pp. 2831, https://doi.org/10.1109/ICIRCA51532.2021.9545028.
Dahl GE, Yu D, Deng L, et al. Contextdependent rretrained deep neural networks for largevocabulary speech recognition. IEEE Trans Audio Speech Lang Process. 2015;20(1):30–42.
Braunschweiler N, Doddipatla R, Keizer S, Stoyanchev S. Factors in emotion recognition with deep learning models using speech and text on multiple corpora. IEEE Signal Process Lett. 2022. https://doi.org/10.1109/LSP.2022.3151551.
Michelsanti D, et al. An overview of deeplearningbased audiovisual speech enhancement and separation. IEEE/ACM Trans Audio Speech Lang= Process. 2021;29:1368–96. https://doi.org/10.1109/TASLP.2021.3066303.
Zhao Z, Zhao R, Xia J, et al. A novel framework of threehierarchical offloading optimization for MEC in industrial IoT networks. IEEE Trans Industr Inf. 2020;16(8):5424–34.
Shi L, Nazir S, Chen L, et al. Correction to: secure convergence of artificial intelligence and internet of things for cryptographic ciphera decision support system [J]. Multimed Tools Appl. 2021;80:31465. https://doi.org/10.1007/s11042021109750.
Wicaksono MGS, Suryani E. Rully Agus Hendrawan, Increasing productivity of rice plants based on IoT (Internet Of Things) to realize smart agriculture using system thinking approach. Procedia Comput Sci. 2022;197:607–16.
Li Dashe, Sun Yuanwei, Sun Jiajun, Wang Xueying, Zhang Xuan. An advanced approach for the precise prediction of water quality using a discrete hidden Markov model. J Hydrol. 2022;609:127659.
Lin J, Sironi E. Sparse logistic maximum likelihood estimation for optimal wellbeing determinants. IEEE Trans Emerg Top Comput. 2021;9(3):1316–27. https://doi.org/10.1109/TETC.2020.3009295.
Hinton GE, Osindero S, Teh Y. A fast learning algorithm for deep belief nets. Neural Comput. 2006;18(3):1527–54.
Zhang L, Wang J, Wang W, Jin Z, Zhao C, Cai Z, Chen H. A novel smart contract vulnerability detection method based on information graph and ensemble learning. Sensors (Basel). 2022;22(9):3581. https://doi.org/10.3390/s22093581.
Li X, Gao X, Wang C. A novel restricted boltzmann machine training algorithm with dynamic tempering chains. IEEE Access. 2021;9:21939–50. https://doi.org/10.1109/ACCESS.2020.3043599.
Yan Y, Cai J, Tang Y, Yaowen Yu. A Decentralized Boltzmannmachinebased fault diagnosis method for sensors of Air Handling Units in HVACs. J Build Eng. 2022;50:104130.
Hinton G, Salakhutdinov R. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–7.
Chen Q, Pan G, Chen W, Wu P. A novel explainable deep belief network framework and its application for feature importance analysis. IEEE Sens J. 2021;21(22):25001–9. https://doi.org/10.1109/JSEN.2021.3084846.
Zhu C, Cao L, Yin J. Unsupervised heterogeneous coupling learning for categorical representation. IEEE Trans Pattern Anal Mach Intell. 2022;44(1):533–49. https://doi.org/10.1109/TPAMI.2020.3010953.
T. Tambe et al. 9.8 A 25mm2 SoC for IoT Devices with 18ms NoiseRobust SpeechtoText Latency via Bayesian Speech Denoising and AttentionBased SequencetoSequence DNN Speech Recognition in 16nm FinFET [J], 2021 IEEE International Solid State Circuits Conference (ISSCC), 2021, pp. 158–160, https://doi.org/10.1109/ISSCC42613.2021.9366062.
Hinton GE, Deng Li, Dong Yu, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag. 2012;29(6):82–97.
Han L. Artificial Neural Networks Tutorial [M]. Beijing: Beijing University of Posts and Telecommunications Press; 2006. p. 330.
Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117.
Saleem N, Gao J, Irfan M, Verdu E. Javier Parra Fuente, E2E–V2SResNet: deep residual convolutional neural networks for endtoend video driven speech synthesis. Image Vis Comput. 2022;119:104389.
Gamanayake C, Jayasinghe L, Ng BKK, Yuen C. Cluster pruning: an efficient filter pruning method for edge AI Vision Applications [J]. IEEE J Sel Top Signal Process. 2020;14(4):802–16. https://doi.org/10.1109/JSTSP.2020.2971418.
Golsanami N, Jayasuriya MN, Yan W, Fernando SG, Liu X, Cui L, Zhang X, Yasin Q, Huaimin Dong Xu, Dong,. Characterizing clay textures and their impact on the reservoir using deep learning and LatticeBoltzmann simulation applied to SEM images. Energy. 2022;240:122599.
Cao T, Zhang H, Song J. BER performance analysis for downlink nonorthogonal multiple access with error propagation mitigated method in visible light communications. IEEE Trans Veh Technol. 2021;70(9):9190–206. https://doi.org/10.1109/TVT.2021.3101652.
Lian Z, Zeng Q, Wang W, Gadekallu TR, Su C. BlockchainBased twostage federated learning with nonIID data in IoMT system. IEEE Trans Comput Soc Syst. 2022. https://doi.org/10.1109/TCSS.2022.3216802.
ASM, J Sejpal, P Rithvij, PS. Thridhamnae and PK. Performance Analysis of SubOptimal LDPC Decoder for 5G using Belief Propagation Algorithm [J], 2021 10th international conference on internet of everything, microwave engineering, communication and networks (IEMECON), 2021, pp. 1–5, https://doi.org/10.1109/IEMECON53809.2021.9689078.
P Peng, W Zhang, Y Zhang, H Wang and H Zhang. Imbalanced fault diagnosis based on particle swarm optimization and sparse autoencoder [C], 2021 IEEE 24th international conference on computer supported cooperative work in design (CSCWD), 2021; pp. 210–213, https://doi.org/10.1109/CSCWD49262.2021.9437742.
Bahmei B, Birmingham E, Arzanpour S. CNNRNN and Data Augmentation Using Deep Convolutional Generative Adversarial Network For Environmental Sound Classification. IEEE Signal Process Lett. 2022. https://doi.org/10.1109/LSP.2022.3150258.
Bengio Yoshua. Deep learning. Cambridge: MIT Press; 2015.
Khelili MA, Slatnia S, Kazar O, Merizig A, Mirjalili S. Deep learning and metaheuristics application in internet of things: A literature review [J]. Microprocess Microsyst. 2023;98:104792.
Zuchri Abdussamad, Isaac Tweneboah Agyei, Esra Sipahi Döngül, Juriko Abdussamad, Roop Raj, Femmy Effendy, Impact of internet of things (IOT) on human resource management: a review [J], materials today: proceedings, 2021.
H Kashif, MN Khan and Q Awais. Selection of network protocols for internet of things applications: a review [J], 2020 IEEE 14th international conference on semantic computing (ICSC), 2020, pp. 359–362, https://doi.org/10.1109/ICSC.2020.00072.
Emad H. Abualsauod, A hybrid blockchain method in internet of things for privacy and security in unmanned aerial vehicles network. Comput Electr Eng. 2022;99:107847.
Zhang Hongfei, Zhu Li, Zhang Liwen, Dai Tao, Feng Xi, Zhang Li, Zhang Kaiqi, Yan Yutian. Smart objects recommendation based on pretraining with attention and the thing–thing relationship in social Internet of things. Future Gener Comput Syst. 2022;129:347.
Frikha MS, Gammar SM, Lahmadi A, Andrey L. Reinforcement and deep reinforcement learning for wireless internet of things: a survey. Comput Commun. 2021;178:98–113.
Saini DK, Saini H, Gupta P, Mabrouk AB. Prediction of malicious objects using preypredator model in Internet of Things (IoT) for smart cities. Comput Ind Eng. 2022;168:108061.
Hinze A, Bowen J, König JL. Wearable technology for hazardous remote environments: smart shirt and Rugged IoT network for forestry worker health. Smart Health. 2022;23:100225.
Borcoci E, Drăgulinescu AM, Li FY, Vochin MC, Kjellstadli K. An overview of 5G slicing operational business models for internet of vehicles, maritime IoT applications and connectivity solutions. IEEE Access. 2021;9:156624–46. https://doi.org/10.1109/ACCESS.2021.3128496.
Alavikia Zahra, Shabro Maryam. A comprehensive layered approach for implementing internet of thingsenabled smart grid: a survey. Dig Commun Netw. 2022. https://doi.org/10.1016/j.dcan.2022.01.002.
Mao Z, Liu X, Peng M, Chen Z, Wei G. Joint channel estimation and activeuser detection for massive access in internet of things—a deep learning approach. IEEE Internet Things J. 2022;9(4):2870–81. https://doi.org/10.1109/JIOT.2021.3097133.
SEGARS, SIMON, ARM9 Family high performance microprocessors for embedded applications[c]. Proceedingsieee international conference on computer design: vlsi in computers and processors (1998).
Nassif Ali Bou, Shahin Ismail, Hamsa Shibani, Nemmour Nawel, Hirose Keikichi. CASAbased speaker identification using cascaded GMMCNN classifier in noisy and emotional talking conditions. Appl Soft Comput. 2021;103:107141.
Devi KJ, Singh NH, Thongam K. Automatic speaker recognition from speech signals using self organizing feature map and hybrid neural network. Microprocess Microsyst. 2020;79:103264.
Wang NYH, et al. Improving the intelligibility of speech for simulated electric and acoustic stimulation using fully convolutional neural networks. IEEE Trans Neural Syst Rehabil Eng. 2021;29:184–95. https://doi.org/10.1109/TNSRE.2020.3042655.
V Sharma, M Jaiswal, A Sharma, S Saini and R Tomar. Dynamic two hand gesture recognition using CNNLSTM based networks [C], 2021 IEEE international symposium on smart electronic systems (iSES), 2021, pp. 224229, https://doi.org/10.1109/iSES52644.2021.00059.
Yang X, Wu Z, Zhang Q. Bluetooth indoor localization With GaussianBernoulli Restricted Boltzmann machine plus liquid state machine. IEEE Trans Instrum Meas. 2022;71:1–8. https://doi.org/10.1109/TIM.2021.3135344.
Iiduka H. Appropriate learning rates of adaptive learning rate optimization algorithms for training deep neural networks. IEEE Trans Cybern. 2022. https://doi.org/10.1109/TCYB.2021.3107415.
Zhou S, Chen Q, Wang X. Fuzzy deep belief networks for semisupervised sentiment classification. Neurocomputing. 2014;131:312–22.
S. Sridhar and S. Sanagavarapu, Analysis and prediction of Bitcoin Price using Bernoulli RBMbased deep belief networks [C], 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), 2021, pp. 16, https://doi.org/10.1109/INISTA52262.2021.9548422.
Liu F, Zhang X, Wan F, Ji X, Ye Q. Domain contrast for domain adaptive object detection. IEEE Trans Circuits Syst Video Technol. 2022. https://doi.org/10.1109/TCSVT.2021.3091620.
Bian C, Yang S, Liu J, Zio E. Robust stateofcharge estimation of Liion batteries based on multichannel convolutional and bidirectional recurrent neural networks. Appl Soft Comput. 2022;116:108401.
Schoenmakers M, Yang D, Farah H. Carfollowing behavioural adaptation when driving next to automated vehicles on a dedicated lane on motorways: a driving simulator study in the Netherlands. Transport Res F: Traffic Psychol Behav. 2021;78:119–29.
Sohaee N. Error and optimism bias regularization. J Big Data. 2023. https://doi.org/10.1186/s40537023006859.
Acknowledgements
This research was funded by the National Natural Science Foundation (Grand 61171141, 61573145), the Public Research and Capacity Building of Guangdong Province (Grand 2014B010104001), the Basic and Applied Basic Research of Guangdong Province (Grand 2015A030308018), the Main Project of the Natural Science Fund of JiaYing University (grant number 2017KJZ02), the key research bases being jointly built by Provinces and cities for humanities and social science of regular institutions of higher learning of Guangdong province(grant number 18KYKT11), the cooperative education program of ministry of education(grant number 201802153047), the college characteristic innovation project of education department of guangdong province in 2019(grant number 2019KTSCX169) and the Project of the Natural Science Fund of JiaYing University (grant number 2021KJY05), the authors are greatly thanks to these grants.
Funding
This study was funded by the National Natural Science Foundation (Grant Number 61171141、61573145), the Public Research and Capacity Building of Guangdong Province (Grant Number 2014B010104001), the Basic and Applied Basic Research of Guangdong Province (Grant Number 2015A030308018), the Main Project of the Natural Science Fund of JiaYing University (Grant Number 2017KJZ02), the key research bases being jointly built by Provinces and cities for humanities and social science of regular institutions of higher learning of Guangdong province(Grant Number 18KYKT11), the cooperative education program of ministry of education(Grant Number 201802153047), the college characteristic innovation project of education department of guangdong province in 2019(Grant Number 2019KTSCX169) and the Project of the Natural Science Fund of JiaYing University (Grant Number 2021KJY05).
Author information
Authors and Affiliations
Contributions
HjZ was the lead author of this study that was responsible for collecting data, analyzing it, creating figures, and summarizing the study. YhC: reviewing and editing. HkZ: supervision. All authors read and approved the final manuscript.
Author’s information
Haijun Zhang is Ph.D., Professor. His research interests include artificial intelligence, machine learning, deep learning, big data processing and so on. Email: nihaoba_456@163.com.
Hankui Zhuo is Ph.D., Professor, Doctoral supervisor, Winner of Guangdong Outstanding Youth Fund. His research interests include intelligent planning, data mining, etc.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This article does not contain any studies with human participants or animals performed by any of the authors.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, Hj., Chen, Yh. & Zhuo, H. Unit middleware for implementation of human–machine interconnection intelligent ecology construction. J Big Data 10, 107 (2023). https://doi.org/10.1186/s40537023007874
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537023007874