- Open Access
Using transfer learning for smart building management system
Journal of Big Data volume 6, Article number: 110 (2019)
In building management, energy optimization is one of the main concern that needs to be automated. For automation, an intelligent system needs to be developed. However, an intelligent system needs to be trained in a large dataset before it can be used reliably. In this paper, we present a transfer learning scheme to develop an intelligent system for smart building management system. Specifically, the intelligent system is able to count human inside a room, which can be utilized to adaptively adjust energy usage in a room. The transfer learning scheme employs a deep learning model that is pretrained on ImageNet dataset. To enable the human counting capability, the model is trained on a dataset specifically collected for human counting case.
Currently, the concept of smart city is starting to be applied across the world [1,2,3,4]. Although the concept is typically implemented in a city-scale, it can also naturally be adapted in a more granular context such as in a building . With this scale, the concept can be named as smart building. The implementation of this concept promises a more effective and efficient building management. Unfortunately, applying this concept requires a considerable amount of cost for procuring various type of Internet of Things (IoT) devices. Therefore, typical prototypes of smart building use only closed circuit television (CCTV) cameras as the IoT devices, which usually are already available in the building.
Using only CCTV poses a significant challenge for smart building. In case of energy management, the straightforward implementation of smart building is by using heat sensors to detect activity level in a room, which can be used to adjust the power usage of electric devices in the room. If the only available IoT devices are CCTV, a robust intelligent system with computer vision technology is needed.
To build such a robust computer vision system, a deep learning algorithm needs to be embedded within. Deep learning has been proved to have powerful performance in computer vision case such as image classification [6,7,8,9,10,11], object detection [12,13,14,15], and crowd counting [16,17,18,19,20,21,22,23]. Deep learning is also applicable for analysis of data from CCTV, which streams a big data that is difficult for other machine learning model to extract valuable information from. However, deep learning requires a big dataset for a reliable performance. As the large dataset is not always available for every problem, training a deep learning model from scratch is considered to be impractical. To overcome the challenge, transfer learning has been broadly applied in many deep learning model developments (cite). This study introduces a transfer learning scheme that can be used to develop an intelligent system for smart building management. We focus on the development of intelligent system for counting human in a room, which can be employed for adjusting appliances for energy usage optimization. In addition, we also collected and shared a dataset that can be used in the proposed transfer learning scheme.
The advancement of computer vision nowadays grows astonishingly fast. This growth was initiated by the use of deep learning in the ImageNet Large Scale Visual Recognition Challenge [24, 25]. At glance, it seems that the impressive performance of deep learning is the main cause of the huge growth in computer vision. However, it should be noted that the huge size of ImageNet dataset also contributes significantly to the deep learning performance. ImageNet has about 1.2 million of labeled images, which is currently one of the largest computer vision datasets. Only after it was trained on ImageNet that deep learning finally showed its extraordinary performance . That particular deep learning model for computer vision, namely convolutional neural networks (CNN), was not a first choice for computer vision research since its invention in 1989 .
Unfortunately, a massive dataset such as ImageNet requires a laborious effort to be collected. As the consequence, it is impractical for many problems which has no large dataset available. To cope with the problem, recent research that utilize deep learning employs a concept called as transfer learning. This concept is defined as using a model that was previously trained on data from a task as a base to develop new model for other task. By using transfer learning, it is possible to use a deep learning model that has been pretrained on large dataset to learn from relatively smaller dataset. The use of this concept in deep learning was first initiated by Girshick et al.  to transfer utilize a CNN model pretrained on ImageNet to develop a model for object detection problem. In the following year, Yosinski et al.  exhaustively studied and proved the benefit of transfer learning for deep learning model. Since then, it is a standard to use an ImageNet-pretrained model in many computer vision problems. Even after the development of large dataset for object detection , the use of transfer learning is still widely adopted for the problem.
The benefit of transfer learning is mostly apparent in crowd counting, one of the most extensively studied computer vision problem. The most popularly used dataset in crowd counting, ShanghaiTech dataset , consists of only 1198 images. The other popular dataset, WorldExpo’10 , contains only 3980 images. The smallest dataset for crowd counting, UCF_CC_50 , even contains only 50 images. Despite that, the performance of crowd counting models are consistently growing fast since the use of deep learning in 2013. The fast advancement is possible by the extensive use of transfer learning. Consequently, the state-of-the-art crowd counting models within the last 6 years were always a variant of deep learning. Following this trend, Wang et al. even developed a large simulated dataset for pretraining purpose in crowd counting . The dataset, named as GTA Crowd Counting (GCC), was generated by using Grand Theft Auto (GTA) V game to obtain 15,212 synthetic crowd images.
Transfer learning scheme for intelligent human counting system
For a comprehensive understanding, we depict the whole intelligent system framework in Fig. 1. The proposed transfer learning scheme is part of the framework which is highlighted in green. The transfer learning scheme starts by acquiring a deep learning model that has been pretrained on ImageNet dataset [24, 25]. To convert the pretrained model to an intelligent human counting system, the model needs to be trained with a dataset crafted for human counting task. Therefore, we collected the required dataset, which we call as RHC (Room Human Counting) dataset. After the training, the trained intelligent human counting system is ready to process video streams from a CCTV to output the human count. It is worth noting that the CCTV stream injects a massive data to the intelligent system. For the system to run in real-time, it needs to be implemented using proper Big Data technology. Therefore, the intelligent system should be developed using deep learning libraries that can be implemented on apache spark. Based on the recent survey , Tensorflow  or Caffe  are are excellent options as both libraries are supported by most deep learning frameworks for apache spark. Afterward, the predicted human count from the system is mapped to appliances adjustment setting in a control system.
The images of RHC dataset were extracted from the videos captured by a CCTV in NVIDIA-BINUS AI R&D Center room. The dataset is collected only for one room to introduce a challenge for the future AI model to learn from one room only. This is necessary for developing a system that can adapt to different specification of CCTV in different room. If the model is able to robustly learn from this dataset, then it can be easily retrained using videos with different resolution from different room as long as the resolution of the new dataset is homogeneous.
In this dataset, the videos have a resolution of 640 × 360 pixels with a frame rate of 20 frames/s. There are 44 videos used for this dataset. The total duration of all videos is 206 h 24 min and 23 s. Figure 2 shows sample of images from the dataset.
Annotating a huge amount of data manually requires laborious work, thus it usually is infeasible. One solution that can be used to annotate a massive dataset is by developing an information system specially crafted for annotation task . Therefore, we built an information system to ease the annotation process. This system takes videos from the previous acquisition process and displays them for the annotation process. The detailed explanation of this annotation system is described by Pardamean et al. . In this system, the annotator decided which frame to be annotated from all videos, resulting 1217 annotated images.
The dataset is annotated with the total count of human per image. We do not use the location of each human as annotation like what is typically done in crowd counting research. Training a deep learning model with the location introduces unnecessary complexity as the location information is not needed for controlling appliances usage in a room. The capability of localizing human in the model also reduces the speed of the system, which is vital for a real-time CCTV stream processing.
The human counts in RHC dataset are ranged from 0 to 13 with distributions as shown in Fig. 3. The mean human count in this dataset is 4.1249 with a standard deviation of 2.6206. We can see that the distribution is not uniform. Thus, this dataset can be considered as imbalance, which typically needs special treatment for any machine learning models to learn well from the dataset.
For a typical training procedure of machine learning, we split the dataset into three different sets: training, validation, and test set. The splitting process was done randomly with stratification to the human count. The split ratio between training, validation, and test set is 60:20:20. After the splitting process, we got a dataset with distribution as shown in Table 1.
To understand whether the current size of RHC dataset is enough for transfer learning, we compared the size with public datasets crowd counting. The crowd counting datasets is the most similar dataset to our case, which are also used for counting human. However, crowd counting differs from our case that the images contains huge number of human in outdoor setting. The dataset in crowd counting is typically much smaller than other popular computer vision cases such as image classification and object detection. Consequently, research in crowd counting usually utilize transfer learning. Therefore, the crowd counting datasets are suitable for comparison to RHC dataset. Table 2 lists popular crowd counting datasets as well as RHC dataset together with their size. We omitted GCC dataset in the list since it is a synthetic dataset and typically used only for the pretraining phase of transfer learning scheme. From the comparison, we can infer that RHC dataset size should be enough for deep learning. The size of RHC dataset is the third biggest dataset among the popular crowd counting datasets.
We identified six possible challenges to be solved for a successful model training on RHC dataset. The first challenge is whether the trained model can count persons whose hair is covered. We see this as a challenge since most of the persons in this dataset let their hair uncovered. The second challenge is whether the model can successfully count human with overlapping heads. This challenge is common in crowd counting as the number of human captured in the images is massive. We see that a small portion in the dataset has overlapping heads, mostly for images with a large actual count.
The third challenge is introducing the trained model to exclude human outside of the room when predicting the count. The room in this dataset has a transparent glass wall on the left side, which outside can be clearly seen. Therefore, to produce a correct count prediction, the model needs to be able to exclude the persons outside of the room. The glass wall also causes the fourth challenge. When the outside of the room is darker, it turns into a mirror that reflects the persons inside the room. The model should be able to differentiate between the actual persons and their reflected figure. The fifth challenge is related to the lighting of the room. Part of the room sometime can be darker if there is a presentation session in the room. Therefore, the model should be robust against a different light setting of the room.
The last possible challenge we identified corresponds to the distribution of this dataset. As given in Fig. 3, this dataset is not balanced to all possible count. The larger the difference between labeled counts to its mean, the smaller the number of images they have. This condition generally leads to poor performance for the labels with fewer images. This problem is called imbalanced data problem and is known to cause diminishing performance for machine learning models as well as deep learning models [39,40,41]. In counting case, one of the possible solutions to this problem is to create a model that is capable to extrapolate its count prediction to count labels with fewer data.
We conducted an experiment to measure the performance of developed intelligent human counting system. In the experiment, we consider five popular CNN models as the pretrained model: AlexNet , VGGNet , GoogLeNet , ResNet , and DenseNet . To enable all models to learn from RHC dataset, we changed the prediction layers with a fully connected layer consisting of one neuron. The layer outputs a single number as a predicted human count. Because the input image size of these networks is 224 × 224, we resized the images in the dataset to the size before feeding them to the networks. All models are trained using Adam optimization algorithm  with learning rate 0.001. The performance of each model is measured using Mean Squared Errors (MSE) of the difference between predicted count and actual count.
Results and discussions
Table 3 lists all models MSE for the test split of RHC dataset. The best MSE is achieved by AlexNet, which has the smallest number of layers. We can see a trend that the more layers the model has, the MSE is declining. We suspect that this is caused by overfitting that is suffered by the more complex models.
To check our assumption of overfitting, we tabulate the MSE for each actual count in Table 4. We also plot the MSE in Fig. 4. We can see that the complex models tend to perform worse in the actual count with less training data. Thus, we can confirm that the poor performance from the complex models is caused by overfitting.
We picked up several cases that correspond to the challenges we addressed before. These cases are tabulated in Table 5. In image (a) in the table, we see a person with a veil. Because in most training data the hair of each person is seen, we suspect that the models might unable to count a person whose hair is covered. However, that seems to be not the case as the models count for 5.77 persons in average for this image, with the actual count is 5 persons. The problem instead is the failure of the model to exclude a person that is actually outside the room. The average count prediction is approaching 6 which indicates that the models tend to count an excessive person, which is likely the person near the most left person in the room. This failure is supported with other picture with a similar case as depicted in (b).
Although the models seem to unable to exclude humans outside the room, it is not the case if there are more than one person outside. As seen in (c), the models do not suffer over-counting problem caused by outside persons. This fact is proved by similar predictions by all models in image (d) which outside room is relatively clear.
In fact, all models instead suffer under-counting in predicting image (c) and (d). The average count prediction is 7.23 persons compared to 11 persons in the actual count. This under-counting might be caused by several persons with an overlapping head as seen in image (c). This problem is also the possible cause of the poor performances of all models in Table 4 for large count number. However, this under-counting does not appear in images with fewer human such as image (e). In this image, there are 2 persons with overlapping head. The average count in this image is 2.83 persons, approaching the actual count of 3 persons. Therefore, the models are able to predict this case without notable problem.
In addition to image with large human count, we also checked the opposite extreme, which are images with a small human count. The average prediction of image (f), which contains only 1 person, is 2.84. This indicates that the models are over-counting. However, in this case, it seems that the over-counting is not caused by the persons outside the room, as there are more than 2 persons clearly seen outside. Thus, we expect the over-counting probably caused by imbalance data instead.
Trained with RHC dataset, all models seem to have a robust performance against different lighting. For instance, image (g) is slightly darker than most of images in RHC dataset. However, the performances of all models are still reliable, with a slight under-counting that might be caused by overlapping instead. The models are also robust against the case where the outside room is dark, which makes the glass that separates inside and outside reflective. an example of this case is provided in image (h). It can be seen that all models do not suffer over-counting caused by the reflected figure of the persons in the room.
Conclusion and future works
In this paper, we showed that transfer learning can be used to develop an intelligent human counting system, which can be utilized for energy optimization in smart building management. To enable the development, RHC dataset is collected to train a pretrained deep learning model for counting human in a room. The result of this study shows that AlexNet is the best model for the pretrained model in the proposed transfer learning scheme. However, the size of this dataset seems insufficient to train more complex networks than AlexNet. This indicates that the dataset should be appended with more data in the future. Additionally, it is interesting to extend this dataset with additional annotations for the coordinate of each human. We believe that this additional annotation can help a complex model to improve its performance.
Availability of data and materials
The RHC dataset is available at http://bdsrc.binus.ac.id/~wawan/rhc/.
closed circuit television
convolutional neural networks
GTA Crowd Counting
Grand Theft Auto
Internet of Things
- MS COCO:
microsoft common object in context
room human counting
Hao L, Lei X, Yan Z, ChunLi Y. The application and implementation research of smart city in china. In: 2012 international conference on system science and engineering (ICSSE). New York: IEEE; 2012. p. 288–92.
Dameri RP. Searching for smart city definition: a comprehensive proposal. Int J Comput Technol. 2013;11(5):2544–51.
Van den Bergh J, Viaene S. Unveiling smart city implementation challenges: the case of Ghent. Inf Polity. 2016;21(1):5–19.
Muchtar K, Rahman F, Cenggoro TW, Budiarto A, Pardamean B. An improved version of texture-based foreground segmentation: block-based adaptive segmenter. Procedia Comput Sci. 2018;135(September):579–86.
Minoli D, Sohraby K, Occhiogrosso B. Iot considerations, requirements, and architectures for smart buildings-energy optimization and next-generation building management systems. IEEE Internet Things J. 2017;4(1):269–83.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. 2012. p. 1097–105.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. arXiv preprint arXiv:1409.1556.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. p. 1–9.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770–8.
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence. 2017.
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 4700–8.
Lin T-Y, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. 2017. p. 2980–8.
Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49.
Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). 2016. p. 779–88.
He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. In: 2017 IEEE international conference on computer vision (ICCV). 2017. p. 2980–8.
Li Y, Zhang X, Chen D. Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). 2018.
Liu L, Wang H, Li G, Ouyang W, Lin L. Crowd counting using deep recurrent spatial-aware network. In: Proceedings of international joint conferences on artificial intelligence organization (IJCAI). 2018.
Shi Z, Zhang L, Liu Y, Cao X, Ye Y, Cheng M-M, Zheng G. Crowd counting with deep negative correlation learning. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2018. p. 5382–90.
Cenggoro TW, Aslamiah AH, Yunanto A. Feature pyramid networks for crowd counting. In: To appear: 2019 international conference of computer science and computational intelligence. 2019.
Chen X, Bin Y, Sang N, Gao C. Scale pyramid network for crowd counting. In: 2019 IEEE winter conference on applications of computer vision (WACV). New York: IEEE; 2019. p. 1941–50.
Liu N, Long Y, Zou C, Niu Q, Pan L, Wu H. Adcrowdnet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). 2019.
Liu W, Salzmann M, Fua P. Context-aware crowd counting. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). 2019.
Shi M, Yang Z, Xu C, Chen Q. Revisiting perspective information for efficient crowd counting. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). 2019.
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. New York: IEEE; 2009. p. 248–55.
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. Imagenet large scale visual recognition challenge. Int J Comput Vis. 2015;115(3):211–52.
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989;1(4):541–51.
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. p. 580–7.
Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In: Advances in neural information processing systems. 2014. p. 3320–8.
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in context. In: European conference on computer vision. Berlin: Springer; 2014. p. 740–55.
Zhang Y, Zhou D, Chen S, Gao S, Ma Y. Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 589–97.
Zhang C, Li H, Wang X, Yang X. Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. p. 833–41.
Idrees H, Saleemi I, Seibert C, Shah M. Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2013. p. 2547–54.
Wang Q, Gao J, Lin W, Yuan Y. Learning from synthetic data for crowd counting in the wild. 2019. arXiv preprint arXiv:1903.03303.
Johnsirani Venkatesan N, Nam C, Shin DR. Deep learning frameworks on apache spark: a review. IETE Tech Rev. 2019;36(2):164–77.
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16). 2016. p. 265–83.
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia. New York: ACM; 2014. p. 675–8.
Cenggoro TW, Tanzil F, Aslamiah AH, Karuppiah EK, Pardamean B. Crowdsourcing annotation system of object counting dataset for deep learning algorithm. In: IOP conference series: earth and environmental science, vol. 195. Bristol: IOP Publishing. 2018. p. 012063.
Pardamean B, Cenggoro TW, Chandra BJ. Rahutomo: a user interface for rapid data annotation of room activity level detection system. In: To appear: 2019 international conference on eco engineering development (ICEED). Bristol: IOP Publishing; 2019.
Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1):27.
Cenggoro TW, Isa SM, Kusuma GP, Pardamean B. Classification of imbalanced land-use/land-cover data using variational semi-supervised learning. In: 2017 international conference on innovative and creative information technology (ICITech). New York: IEEE; 2017. p. 1–6.
Cenggoro TW. Deep learning for imbalance data classification using class expert generative adversarial network. Procedia Comput Sci. 2018;135:60–7.
Kingma DP. Ba J. Adam: a method for stochastic optimization. In: The international conference on learning representations 2015, San Diego, CA. 2015.
The raw videos was captured using CCTV in NVIDIA-BINUS AI R&D Center room. The experiments was run using NVIDIA Tesla P100 and P4 from NVIDIA-BINUS AI R&D Center.
This study is funded by Directorate of Research and Community Service, Directorate General of Research and Development, Indonesian Ministry of Research, Technology and Higher Education (Grant No. 23/AKM/MONOPNT/2019) as a part of 2019 Penelitian Terapan Unggulan Perguruan Tinggi Research Grant.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Pardamean, B., Muljo, H.H., Cenggoro, T.W. et al. Using transfer learning for smart building management system. J Big Data 6, 110 (2019) doi:10.1186/s40537-019-0272-6
- Transfer learning
- Deep learning
- Human counting
- Smart building
- Building management system