The popularity of social media and computer-mediated communication has resulted in high-volume and highly semantic data about digital social interactions. This constantly accumulating data has been termed as Big Social Data or Social Big Data, and various visions about how to utilize that have been presented. However, as relatively new concepts, there are no solid and commonly agreed definitions of them. We argue that the emerging research field around these concepts would benefit from understanding about the very substance of the concept and the different viewpoints to it. With our review of earlier research, we highlight various perspectives to this multi-disciplinary field and point out conceptual gaps, the diversity of perspectives and lack of consensus in what Big Social Data means. Based on detailed analysis of related work and earlier conceptualizations, we propose a synthesized definition of the term, as well as outline the types of data that Big Social Data covers. With this, we aim to foster future research activities around this intriguing, yet untapped type of Big Data.
We live in an “always-on society” [1–3], meaning that people constantly interact with each other. Due to the rapid development of social computing and mushrooming of social media services, much of social interaction is nowadays mediated by information technology and takes place in the digital realm. An average Internet user consumes and shares large amounts of digital content every day through popular social online services, such as Facebook, Twitter, YouTube, Instagram and SnapChat.
From data perspective, this has led to emergence of extensive amounts of human-generated data [4, 5] with diverse social uses and rich meanings (for example, communication text, videos for entertainment and self-representation, sharing of news and other 3rd party content in social media). Such unstructured/semi-structured, yet semantically rich data has been argued to constitute 95% of all Big Data . This Social Data explosion has resulted in theorizations and studies about the emerging topic of Big Social Data (BSD).
Broadly speaking, BSD refers to large data volumes that relate to people or describe their behavior and technology-mediated social interactions in the digital realm. The sheer volume and semantic richness of such data opens enormous possibilities for utilizing and analyzing it for personal [7, 8], commercial [9, 10] as well as societal purposes [11–13]. For example, the scattered social media would benefit from meta-services that bring together all the content from a user. Commercial use could include even more targeted advertising, matchmaking services, or many unimaginable data-centered business models [14, 15]. The search for beneficial applications and services in regard to BSD has only just begun.
Central concepts and goals of the research
In the research literature, the concept of Big Social Data has been defined and interpreted in many ways for various purposes; for example, the viewpoints from which it has been explored include social media, online social networks, social computing, and computational social science (CSS). The role of these fields in the scope of BSD is discussed in detail in the following sections.
As a rule, BSD is mainly utilized to extract insights from social media data and online social interactions of people for descriptive or predictive purposes to influence human decision-making in various application domains [16–18]. In general, researchers have focused on the analytics and utilization, having paid little attention to clarifying the very concept of BSD and understanding the related phenomena (for example, [19–21]).
In fact, there seems to be lack of consensus about the definition of BSD and the related terms, as we will analyze in the upcoming sections. Inconsideration of proper conceptualization may bring researchers methodological challenges in their studies, especially in such inherently broad and multi-disciplinary field as BSD.
Therefore, we argue for conceptual and theoretical work about the concept of BSD in order to inform future research activities as well as to foster the practical utilization of the data, which may signify social insight. There is a timely need to describe, review, and reflect on BSD literature in order to bring clarity to the concept and understanding about its beneficial opportunities for the practitioners of computational social science and other related research fields.
The potential value of this paper for the readers is presented as follows:
Firstly, by the literature review we aim to bring clarity on various existing BSD concepts and its definitions. We discuss relations between BSD and related fields of science in order to inform readers about the domains where this concept is currently applied. We consider these aspects will help researchers to properly identify scope and directions for their investigation on the topic;
Secondly, by providing a synthesized concept and definition of BSD we want to motivate researchers to develop better conceptualizations and clarifications of the BSD meaning in regard to their research. Currently, the majority of papers related to the topic are focused on analytical tasks and methods missing the explanation about what researchers consider as BSD and why. As an improvement step towards a holistic approach to this emerging field, BSD practitioners can utilize the definition presented in this work by revising it according to their research objectives;
By providing a comprehensive list of BSD types we aim to inform researchers about categories of data that is currently available for research and analysis. This serves as a starting point to identify research opportunities and practical means towards data-driven research. It is worth noting that there is no extensive taxonomy of BSD in related literature and we neither aim to design one; however, our classification of such data serves as an inducement to the research community for collaboratively creating this taxonomy;
Moreover, by describing the key characteristics of BSD we differentiate it from the concept of Big Data. By doing so, we anticipate the emphasis on its unique qualities to open new opportunities for multi-disciplinary research ventures.
In general, we assume this work will attract researchers’ attention to explore the holistic view on BSD concept and help them to identify relevant sources of data to utilize in BSD studies.
Related concepts and literature
Due to rapid development of online social services and tremendous growth of data therein, various concepts have emerged in different research fields to help understanding digital environments and their social effects. This section reviews related concepts relevant to BSD and their correlations, as well as outlines existing literature on the topic (see Fig. 1).
There are many interpretations and terms to refer to the “social” aspect in Big Data. The most widespread terms so far are Social Big Data (SBD) and Big Social Data (BSD). Various definitions and approaches are presented and compared in the following, in order to outline the existing research directions.
Big Social Data as science: Ishikawa’s and Pentland’s concepts
Hiroshi Ishikawa is a central adherent of Social Big Data concept, which he described and defined in his book as science of analyzing interconnections between physical world data and social data for the good of public:
“Analyzing both physical real world data (heterogeneous data with implicit semantics such as science data, event data, and transportation data) and social data (social media data with explicit semantics) by relating them to each other, is called Social Big Data science or Social Big Data for short” .
It is worth noting that Ishikawa is one among few who provide a proper conceptualization of his ideas and views on the social phenomenon in Big Data. Accordingly, he clarified and supported by arguments relevant related terms, data sources and analytical approaches.
Thus, he defines social data as social media data, which, in his opinion, is one kind of Big Data with four V’s characteristics—volume, variety, velocity and vague. While the first three and veracity characteristics are already discussed in multiple studies on Big Data [23–26], the vagueness first appears in this book as essential characteristic of social data. It should not be mixed with vagueness proposed by Venkat Krishnamurthy on Big Data Innovation Summit in Silicon Valley in 2014, which refers to the confusion over the meaning of Big Data [27–29]. According to Ishikawa, vagueness characteristic is a result of a combination of various types of data to be analyzed, which lead to inconsistency and deficiency. It also relates to the issues of privacy and data management as social data involves individuals’ personal information.
Additionally, Ishikawa classifies the sources of social media data accordingly: blogging, micro blogging, social network services, sharing and video communication services, social news and gaming, social search and crowd sourcing services, and collaboration services. All data in such services would therefore be regarded as Big Social Data.
Ishikawa is interested in relationships between physical and cyber worlds. He considers SBD should follow the bidirectional analysis that includes influences from the physical real world on social media, and vice versa, in order to develop a complete model (theory). Such theory may explain interactions between both realms and enable potential prediction, recommendation and problem solving. In other words, he suggests tracking social media data and physical world data in order to reveal mutual interdependencies that in turn would result in actual insight. Ishikawa provides an example of traffic authorities predicting public transportation issues in context of massive social events that are actively discussed in social networks, blogs, news, etc. Thus, the data from social media could be analyzed to prevent traffic jams or to increase the amount of public transportation next to the event location.
Ishikawa’s thinking is in line with Pentland’s concept of social physics . According to Pentland, social physics is the “quantitative social science that describes reliable, mathematical connections between information and idea flow on the one hand and people’s behavior on the other”. While Ishikawa aims to bring clarity about analytical techniques for SBD (for example, modeling, data mining, multivariate analysis), Pentland envisions a data-driven society. Even though Pentland does not utilize SBD or BSD terms directly in the conceptualization, he defines Big Data as the engine of social physics. The author refers to the data about human behavior, which consists of both human-generated content (from social media platforms) and data from the physical world (for instance, transactions, locations, call records), which is similar to Ishikawa’s vision about social data sources. The main goal of Petland’s research is to show how this data together with social science theories could be applied in practical settings.
Data-driven approaches to Big Social Data
Guellil and Boukhalfa consider SBD as a part of social computing . To differentiate their view on SBD from general Big Data, authors provide certain characteristics referring to the research of Tang et al. : “the set of links (due to relationships between users), a nonstructural nature (due to the length of messages required by some microblogging, the presence of spelling mistakes or other) and the lack of completeness (due to certain user requirements for data privacy)”. Authors provide a classification of the research works on SBD and discuss various analytical approaches and related challenges.
Guellil and Boukhalfa compile their vision of SBD based on the works of Barbier , Mukkamala  and Nguyen . Notably, Mukkamala and Nguyen utilize SBD and BSD terms interchangeably and mention only social media data as a major data source. Even though Guellil and Boukhalfa point out the inconsistent use of terms in related literature, they do not provide clear conceptualization of the SBD in their own research. In fact, SBD term from the perspective of Guellil and Boukhalfa might be interpreted as a synonym of social media data with qualities such as large volume, noisiness and dynamism that were already revealed earlier in Barbier’s work.
From another perspective, Mark Coté makes the attempt to distinguish BSD concept from the broader category of Big Data . In his viewpoint, Big Data is any data produced as the result of the quantification of the world that may include data from sensors, multiple industrial and domestic networks as well as financial markets, whereas BSD “comes from the mediated communicative practices of our everyday lives, whenever we go online, use our smartphone, use an app or make a purchase.” Moreover, Cote provides reasoning for the importance of BSD. According to him, the concept is not novel, but may significantly affect the media theory. Among those reasons are: the enormous size of data generated by humans that enables endless future analysis; the symbolic nature of social data that is challenging to process even though it is produced in the structured platform spaces; the infrastructure of BSD is very distributed that require scalable computer architecture and network capacity; challenges related to processing, storing, costs and data regulations.
Purpose-driven approaches: Big Social Data for society
Jean Burgess and Axel Bruns discuss Big Data in terms of social media and use the BSD term to refer to this research area . Their vision is based on Manovich’s ideology , which is focused on bringing the potential of social or cultural data into humanities and social sciences. Thus, Jean Burgess and Axel Bruns present the BSD concept by mentioning the shift of Big Data towards media, communication, cultural and computational social science, which has led to the wave of research on digital humanities [39–41]. According to Burgess and Bruns, such changes “...provoked in large part by the dramatic quantitative growth and apparently increased cultural importance of social media—hence, “big social data”. Their research is aimed to clarify the role of social media in context of the contemporary media ecology with focus on communication, societal events and the nature of human’s engagement by applying computational methods towards Twitter archives. Inspired by the Manovich’s concept of BSD they trialled the feasibility of research on the phenomenon in order to reveal potential technical, political and epistemological issues. They identified ethical concerns as well as data accessibility, authenticity and reliability challenges. Based on the results, they stated that research on BSD requires the elaboration of mature conceptual models and methodological priorities.
Housley et al.  also take a society-oriented view to discuss Big Data. The authors have been conducting observatory research on the opportunities and challenges of open source social media data in the context of social sciences. They seek for the governance and organization improvements through the sense of civil society by means of ‘big and broad’ social data. According to authors, the term “big and broad” social data refers to three V’s (volume, variety, velocity)—already well-known dimensions of related data, which also might be real-time and dynamic. Accordingly, social media could be used to empower people engagement in civil society through a methodological approach to generate sociological insight as proposed in the paper. William Housley et al. characterize digital innovations with qualities such as interaction, participation and “social” that affect complicated relationships between data and analytical capacity, thus enabling participatory infrastructure for public sociology. Consequently, in this regard, the authors point to “citizen social science”, which is aimed to assist social scientists by decreasing the challenges of social media data with the help of volunteers among citizens . Such members of public may contribute with research by recording their knowledge, opinions and beliefs, thus connecting the social science academy and society [44, 45].
Big Social Data as method
Bello-Orgaz et al.  consider SBD is a combination of Big Data and social media. According to the authors, SBD is needed for analysis of large amount of data from diverse social media sources. They theorize the concept as follows: “Those processes and methods that are designed to provide sensitive and relevant knowledge to any user or company from social media data sources when data sources can be characterized by their different formats and contents, their very large size, and the online or streamed generation of information”.
Thus, the conceptual map of SBD from Gema Bello-Orgaz et al. incorporates Big Data as processing paradigm, social media as the main source of data, and Data Analysis as method gaining and analyzing knowledge. Authors revise analytical methodologies for social media as well as new related applications and frameworks.
Summary of the related literature
Even though not all in the above-mentioned papers explicitly use BSD as a term, we consider these works are relevant to the topic. Researchers try to clarify the phenomenon of rapidly growing amount of human-related social data and seek for ways to apply it for the good of the society, data analytics and various fields of science. The key content of the approaches under discussion and theorizations about BSD is summarized in Table 1.
One central commonality among existing research directions is the presence of social media as major data source and orientation towards analytics. The conceptualizations in these scientific articles vary from fundamentally broad (e.g. Ishikawa  and Pentland ) to vaguely described (e.g., Guelil and Boukhalfa ). Additionally, there are only a few attempts to distinguish the concepts from mere Big Data. What is also important, there is lack of clarity regarding the relations between researchers’ concepts and related fields: it is hard to outline how other sciences affect the scope of BSD/SBD and directions of studies. Moreover, it is often confusing what data types are considered relevant and valuable for research, and it is hard to understand which data was utilized in the reported research.
We conclude that there are research gaps that researchers of BSD should bridge in order to achieve holistic understanding about the concept of BSD and its characteristics. For example, it is essential to identify the data types that can be explored and studied in this domain. Sophisticated conceptualization and definition of BSD would help researchers build proper methods to process and analyze it. This is essential also because the growth in human-generated data engenders new challenges to solve, requiring novel tools, frameworks and methodological approaches as well as multidisciplinary expertise.
Theoretical foundations of Big Social Data
Based on the literature overview we perceive the concept of BSD as a combination of four fields of science: social computing (including social media and social networks), Big Data science and data analytics as fields that enable and contribute to the existence of the data, and CSS as a field that primarily utilizes the data to gain insight and conduct research (see Fig. 2).
We emphasize that the concept should be understood in an interdisciplinary way in order to open new research avenues. The current and possible roles of each field of science in the context of BSD are discussed in the following.
Social computing is a research and application field that integrates social and computational sciences . According to Wang, the theoretical foundations of social computing incorporate Social Psychology, Sociology, Social Network Analysis, Anthropology as well as theories of organization, communication, human–computer interaction and computing theory. In his work, Kling  addresses the idea of a mutual interference between communication technologies and society. Therefore, social computing favorably affects both society and technology development: on the one hand enabling smooth socialization and social interactions through various computational systems, and on the other hand, introducing social practices and theories in the development of computational systems and applications. In terms of BSD, social computing enables services for technology-mediated self-representation  and communication and supports the building and maintaining of digital relationships through multiple technological infrastructures (for example, Web, database, multimedia and wireless technologies). In summary, social computing approaches the topic from the perspectives of applications, communication and business.
Big data science
Big data science refers to a field that processes and manages high-volume, high-velocity and high-variety data in order to extract reliable and valuable insights . Big Data is aimed to serve large-scale digital applications and computational systems. Therefore, from BSD perspective, Big Data science provides solutions to process and manage data originated from technology-mediated social interactions in the context of numerous social services and applications in the digital environment. There are both optimistic and realistic approaches in regard to recent interest to Big Data technology. One group of researchers (as a rule business-oriented) discusses potential benefits of utilizing Big Data [51, 52] to study massive data about people, things and interactions, while other researchers appeal to critical questions, assumptions and issues that may occur when accessing such data [53–55]. It is crucial to consider a critical view on BSD concept, because data that is primarily related to digital human interactions would definitely cause controversial challenges (for example, data availability, regulations on accessing data, ethics issues, and privacy). In summary, originating from computer science and information systems Big Data is a broader category than BSD, and has mostly data and infrastructure-centric perspective, for instance, with focus on Hadoop, Spark, clusters, and related infrastructural work.
Data analytics allows the extraction of insight or conclusions from existing massive data sets. Generally, it includes descriptive (describes data), exploratory (discovering unknown correlations in data), predictive (predict events and trends) and prescriptive (suggest actions) methods to gain meaningful insight for different domains [56, 57]. Social Network Analysis (SNA) is one of the most established fields of data analytics [58, 59], providing tools, methods and theories for the research of social networks in the digital realm. Other central areas that can be relevant for BSD include Business Analytics [60, 61] and Sentiment Analytics [62, 63]. Regardless of the intention and application area of the analysis, data analytics can be said to approach BSD from the perspective of utilization of data (for example, service development, gaining insight, decision making).
Computational social science
Definition of the concept is only one step towards proper understanding of BSD. Duncan Watts claimed the potential of Big Data in social domain—“we finally have our telescope” . However, Macy challenges this statement  by referring to Gintis and Helbing  who point out that just having a telescope is not enough. “We also need to know where to point it, and for that we need the core analytical toolkit... Big data needs big theory” . In terms of BSD such a pointer or a guide toward the theory and meaningful applications is CSS . This multidisciplinary field seeks for theory-grounded models of the social phenomena within the intersection of social and computational sciences . CSS determines a joint collaboration between social, behavioral, cognitive and computer scientists with agent theorists, mathematicians and physicists . According to Conte, CSS is going beyond the traditional social science tools to unravel social complexity from new perspectives more deeply . Author highlights that CSS is not only about variables and equations; the major elements of this science are “people, ideas, human-made artifacts, and their relations within ecosystems”. The theorization and modeling of society by means of computational approaches is aimed to bring comprehension of social complexity and the way social systems operate . Thus, we argue that CSS utilizes BSD in order to “serve the public good and examine the public agenda” . In other words, CSS can reveal the meaningful and relevant areas in utilization of BSD, thus pointing directions for the analysis, making sense of the findings and enabling predictions as well as sensible explanations.
In summary, the aforementioned areas are the central conceptual and theoretical foundations of BSD that contribute to this inter-disciplinary concept. Social computing enables and serves technology-mediated social services and applications that in turn generate vast amount of complex social data; such data are managed and processed through Big Data tools; then insights and prescriptions are derived from data analytics methods and algorithms. CSS is one of the key fields to define targets and reasons for the analysis and explanations for the analysis results.
Our synthesis and definition of Big Social Data
Drawing from our overview of the related literature and observation of contributing science fields we provide a meta-level definition of the synthesized BSD concept as follows:
Big Social Data is any high-volume, high-velocity, high-variety and/or highly semantic data that is generated from technology-mediated social interactions and actions in digital realm, and which can be collected and analyzed to model social interactions and behavior.
This definition approaches the concept from the synthesized perspective including the description of social data characteristics, its sources and origins as well as purpose of use:
Characteristics Shortly speaking, in this context, volume refers to the exponential growth of social data. Variety relates to various types and forms of social data sources: it might be structured, semi-structured or unstructured. Variety can also mean the difference of formats (for instance, text, image, video). Velocity refers to the fact that social data is generated and distributed with tremendous speed. One can simply count his/her activity in online services per hour to imagine the frequency, with which billions of people right at this moment create or share something online. These characteristics define the size of social data available for the analysis as well as real-time and dynamic nature of BSD. The volume, velocity and variety are traditional characteristics in any Big Data, while semantic is a more unique characteristic of BSD. It refers to the fact that all content manually created is highly symbolic with various often-subjective meanings, which require intelligent solutions to be analyzed. There have been studies on mining and analyzing such multimedia data [73–76], however we are still far from the degree of the intelligence, which may turn immense pools of user-generated content into meaningful insights.
Data sources and origins In context of BSD, we consider technology-mediated social interactions as origins of social data types. It refers to digital self-representation, technology-mediated communication and digital relationships data that may appear not only in social networks services but in variety of discussion forums, blogs, web and mobile chat applications, multi-player games as well as different web sites that are not for social purposes per se.
Purpose Analyzing and modeling social interactions and behavior means that researchers may use the data to describe, understand, and build models of digital interactions taking place between people and how people act (online) around these interactions (for example, profile building, self-expression and other activities that are not directly seen as interaction but, rather, necessary prerequisites for it). The knowledge, which is gained from analysis, may then be utilized in variety of applications, meaning that BSD practitioners are free to choose which domain or research question to address. For instance, researchers may aim to solve fundamental societal issues or just explore tweets for the sake of testing new semantic algorithms.
The definition is further explicated in the following subsection with the classification of data types that relate to technology-mediated social interactions.
Types of Big Social Data
We emphasize that a central element of the BSD concept is “digital human”, who uses Information and Communications Technology (ICT) for digital social interactions. The rapid evolution of ICT has shifted the role of a user from a consumer to the active producer and mediator of information, thus allowing people to control, personalize and apply the digital realm according to their values, social needs and preferences . We incorporate the term of “digital human” to underline the shift towards new sociality that lives in hybrid reality , where the dynamism and constant availability of technology-mediated communication blurs the boundaries between reality and virtuality. Thus, people do not distinct their activity in online and physical environments, because of “always-on” social networking. Similarly, Wooglar suggests the term of “virtual community” and states that it is just the matter of choosing words: “In this usage, ‘virtual’, like ‘interactive’, ‘information’, ‘global’, ‘remote’, ‘distance’, ‘digital’, ‘electronic’ (or ‘e-’), ‘cyber-’, ‘network’, ‘tele-’, and so on, appears as an epithet applied to various existing activities and social institutions”. .
Around digital human interactions, there are both machine-generated and human-generated data that potentially might turn into the social insight. However, in this paper we argue that exactly human-generated data makes BSD concept unique and distinguishes it from general field of Big Data. While machine-generated data could be analyzed through mere Big Data tools and applications, human-generated content requires more intelligent solutions to decode the semantics of people’s beliefs, opinions and behavior. Undoubtedly, Big Data may show what and how is changing in social interactions, however it does not answer the question of why those changes and processes are happening. Therefore, we consider BSD is the solution to properly investigate the semantics of human-generated content. From our perspective, it may provide to practitioners of many research fields both facts and reasoning.
While discussing human-generated data we mean content that is produced through social technology-mediated interactions of people in social media platforms. This category may contain digital-self representation data, technology-mediated communication data and digital relationships data (see Fig. 3). These three categories define the types of data that could be interpreted and utilized as social data in the current digital environment (see Table 2). In other words, Table 2 serves as a simplified taxonomy of BSD; however, it is not meant as an extensive index of what data is BSD but, rather, as a list of currently existing BSD examples that could be available for research and analysis.
The first category to be discussed is digital self-representation. This is the initial step for “digital humans” to socialize and communicate themselves in the digital realm. These data types relate to numerous virtual profiles that have functions of identity depiction and communicative body . In other words, the data is meant to reveal some information (a “face”) for other users in the particular digital service. Albrechstlund proposes a concept of “sharing yourself”, which is related to the way constructed identity is participating in social networks creating relations with others . In digital environment people are limited in verbal and non-verbal impressions compensating it by means of text, pictures, videos and music that could be placed in the following data categories:
Profile data It includes login data (usually a name/nickname/e-mail address with which other people identify the user); identity data (depends on the digital environment, i.e. for some services one should provide real first name and last name, mobile phone number, country, education, birthday); and personality data (e.g., profile pictures, tags of interest, slogan, personal signature in discussion forums) In many social media services, it is the personality data that the other users particularly focus on and analyze to assess the interestingness of the user.
Self-published content It incorporates publicly disclosed or socially restricted data (to trusted users or specific communities), such as most status updates in social media, pictures, videos, and other content that people add to services to represent themselves.
Data published by the community Self-representation could be complemented through person-related content shared by other users. This refers to collaboratively created pictures, narrations, videos, etc.
Technology-mediated communication data
Technology-mediated communication data refers to the data generated in two-way communication, collaborative knowledge creation and information distribution in the context of digital environment—the content and subjects of the communication. Technology mediates the constructed digital self-representation to contribute information, edit existing contributions, comment on entries and discuss related matters. From the fundamental perspective digital environments allow people to contribute to knowledge creation and distribution through various digital devices . Digital environment facilitates physical communication channels resulting in private communication (i.e., one-to-one), public communication (one-to-many) and collaborative communication (many-to-many) data. Depending on the context, public and collaborative communication could also be private within the group of participants, i.e. in case it is a private channel of the organization.
Digital relationships data
Digital Relationships data describes the explicit connections and ties between users in the services. Analysis of this data can reveal social relationship patterns, social network structures and various other sociological and network level phenomena in the digital realm. Digital Representation category firstly contains explicit data, which refers to digital friendships and followership that a user has intentionally and explicitly defined. Technology-mediated social services provide the possibility to build virtual communities based on both physical and online activities (to create networks based on existing connections in physical world and/or create new networks with people from digital realm). There are two roles for users of such services—to be followee and follower. One could have followers or friends on various social platforms (Facebook, LinkedIn, Twitter, Instagram, and many others), and in turn could follow someone to maintain friendships, business relationships or track important content of another relevant user. An interesting factor to be researched is the motivation of people adding someone to the friend’s lists. Obviously such lists incorporate friends and colleagues, but also there could be public figures, interesting strangers or people with weak ties [81–83]. There is also implicit data, which could be revealed through analysis of technology-mediated communication data. For instance, tweets can be analyzed to infer individual connections between people. And from these individual connections, we can build network representations of communities in system level. As another example, two users having multiple common contacts (e.g., friend-of-a-friend) can be predicted to become explicit contacts in the future. When a user has, for example, liked or otherwise interacted with a non-contact user’s content or profile, there can be seen to be an implicit tie between the users . However, such implicit data normally requires network analysis to be created, and there are few tools or methods to provide such data automatically.
To summarize, we consider this list of BSD types could be valuable for researchers to outline the scope of their interests and will guide them to achieve successful outcomes. Nevertheless, research community has to remember that the accessibility of such data is a crucial challenge of BSD. Lack of access to the data often held by various service providers hinders the utilization of and research opportunities related to this emerging concept. Thus, researches should search for ways of collaboration with social media platforms.
The holistic overview of related concepts, research fields as well as research communities provide ideas regarding methodological steps that should be taken to enable further research and utilization activities around BSD. This is a combination of three activities that should be primarily focused on in order to open new avenues for the utilization.
Collecting data The initial step for all researchers who work with BSD is to collect needed datasets for analysis. This step brings up the ethical issues and challenges of data accessibility. Indeed, there are challenges in terms of accessing the data as it is often held by various service providers, which hinders the utilization of the data. Manovich notes this by stating ‘only social media companies have access to really large social data’ . Fortunately, recently we have seen various movements and joint efforts for bringing together data that, in theory, is public but very challenging to collect in high volume enough for research purposes (for example, the OSoMeFootnote 1 project to help analyzing Twitter data). One of the most troubling issues is related to ethics: majority of people are not aware about their data being collected and analyzed by different organizations (including government and social media companies). Moreover, the regulations on accessing and usage of such data are not clear and not completely unified. There are also challenges that may cause privacy violation: collecting more private data than allowed; accessing data without permissions; utilizing data for purposes, which are different from the initial purpose of collecting the data; misinterpreting the data; and changing the content. To make collecting phase feasible we need to fulfill the next step of our framework.
Collaboration BSD is multidisciplinary area that will require practitioners to build a proper team for work. Our suggestion is to build collaboration with social media platforms or companies that have access to actually large data sets. For instance, the research outcomes from thousands of twits would be questionable in comparison with research under billions of human-generated content from multiple channels. Collaboration with people or companies with various expertize and advantages in terms of social data availability will potentially reduce challenges with collecting data for one’s own study, extend the scale and scope of the work in a positive way as well as provide access to multidisciplinary expertise.
Manipulating data We argue that for gaining meaningful insights from BSD, researchers should design virtual environments where they would be able to access multiple data types, to compare and control them. It may bring new opportunities for authentic and reliable research outcomes. In this regard we agree with Watts  that we need ’social supercollider’, which will obtain diverse social data streams thus opening access to knowledge about people’s behavior on the massive scale. BSD artificial environments also could give opportunity to run virtual experiments and validate results with members of related research community.
This paper was aimed to bring clarity on BSD topic in general for any application area. As for our intended future work, we aim to utilize BSD to foster serendipity and, thus, innovativeness in knowledge work organizations. Our objective is to obtain empirical evidence that analysis of BSD can help identify relevant new people to collaborate with.
The multidisciplinary and multi-dimensional nature of Big Social Data brings challenges to the development of a useful conceptualization and definition of the concept. Our literature overview shows that majority of related work on BSD is focused on the analysis of social data, giving less attention to describing what BSD actually is. This can lead to lack of consensus, inconsistency, and vague understanding of what such data could be used for. To bring clarity and sophisticated understanding of BSD we propose a synthesized conceptualization and definition of the concept and this growing field. We reviewed existing literature that demonstrates a variety of applications and approaches to study the phenomena around social data. Based on this we outlined the fields of science that determine the scope of BSD (social computing, Big Data science, data analytics and CSS). We assume the knowledge about the involvement of each field would provide researches with the understanding of the expertise that is demanded for conducting research in this field. Additionally, we proposed the classification of BSD types that, from our perspective, well cover the spectrum of data that BSD consists of. In summary, with this paper, we aim to make researchers more informed about what is BSD, on what data to focus as well as motivate them to elaborate better conceptualization, in order to reach clear desirable research outcomes.
Observatory on social media (OSoMe) project to study diffusion of information online and discriminate among mechanisms that drive the spread of memes on social media—http://truthy.indiana.edu/about/.
Musacchio M, Panizzon R, Zhang X, Zorzi V. A linguistically-driven methodology for detecting impending disasters and un-folding emergencies from social media messages. In: proceedings of LREC 2016 workshop. EMOT: emotions, metaphors, ontology and terminology during disasters; 2016. p. 26–33.
Guellil I, Boukhalfa K. Social big data mining: a survey focused on opinion mining and sentiments analysis. In: 2015 12th international symposium on programming and systems (ISPS). New York: IEEE; 2015. p. 1–10.
Tang J, Chang Y, Liu H. Mining social media with social theories: a survey. ACM SIGKDD Explor Newsl. 2014;15(2):20–9.
Mukkamala RR, Hussain A, Vatrapu R. Fuzzy-set based sentiment analysis of big social data. In: Enterprise distributed object computing conference (EDOC), 2014 IEEE 18th international. New York: IEEE; 2014. p. 71–80.
Nguyen DT, Hwang D, Jung JJ. Time–frequency social data analytics for understanding social big data. Cham: Springer; 2015.
Housley W, Procter R, Edwards A, Burnap P, Williams M, Sloan L, Rana O, Morgan J, Voss A, Greenhill A. Big and broad social data and the sociological imagination: a collaborative response. Big Data Soc. 2014;1(2):2053951714545135.
Procter R, Housley W, Williams M, Edwards A, Burnap P, Morgan J, Rana O, Klein E, Taylor M, Voss A, Choi C, Mavros P, Hudson Smith A, Thelwall M, Ferne T, greenhill A. Enabling social media research through citizen social science. In: Korn M, Colomnbino T, Lewkowicz M (eds) ECSCW 2013 Adjunct Proceedings, 13th european conference on computer supported cooperative work, 21–25 September 2013, Paphos, Cyprus
Mossberger K, Tolbert CJ, McNeal RS. Digital citizenship: the internet, society, and participation. London: MIt Press; 2007.
Boyd D, Heer J. Profiles as conversation: networked identity performance on friendster. In: Proceedings of the 39th annual Hawaii international conference on system sciences (HICSS’06), vol. 3. New York: IEEE; 2006. p. 59.
Demchenko Y, De Laat C, Membrey P. Defining architecture components of the big data ecosystem. In: 2014 international conference on collaboration technologies and systems (CTS). New York: IEEE. 2014. p. 104–12.
Beyer MA, Laney D. The importance of ’big data’: a definition. Stamford: Gartner; 2012. p. 2014–8.
Davenport T. Analytics 3.0: In the new era, big data will power consumers products and services. Brighton, MA: Harvard Business Review. Retrieved from https://hbr.org/2013/12/analytics-30. 2013. Accessed 15 Oct 2016.
Bendoly E. Fit, bias, and enacted sensemaking in data visualization: frameworks for continuous development in operations and supply chain management analytics. J Bus Logist. 2016;37(1):6–17.
Wallach H. Computational social science: Toward a collaborative future. In: Alvarez RM, editor. Computational social science: Discovery and prediction. USA: Cambridge Universisty Press; 2016. p. 307–16.
Wu P, Hoi SCH, Zhao P, He Y. Mining social images with distance metric learning for automated image tagging. In: Proceedings of the fourth ACM international conference on web search and data mining. New York City: ACM; 2011. p. 197–206.
Hu X, Liu H. Text analytics in social media. New York: Springer; 2012. p. 385–414.
EO performed the primary literature review and analysis for this work as well as designed graphics. Manuscript was drafted by EO, TO and JH. EO introduced this topic to other authors and coordinate the work process to complete the manuscript. EO, TO, JH and HK worked together to develop the article’s framework and focus. All authors read and approved the final manuscript.
We thank all members of the COBWEB project.
The authors declare that they have no competing interests.
This work was supported by the Academy of Finland project 295893, 295894, 295895— “Enhancing Knowledge Work and Co-creation with Analysis of Weak Ties in Online Services (COBWEB)”.
Authors and Affiliations
Department of Pervasive Computing, Tampere University of Technology, Korkeakoulunkatu 10, 33720, Tampere, Finland
Ekaterina Olshannikova & Thomas Olsson
Department of Mathematics, Tampere University of Technology, Korkeakoulunkatu 10, 33720, Tampere, Finland
NOVI research group, Department of Information Management and Logistics, Tampere University of Technology, Korkeakoulunkatu 10, 33720, Tampere, Finland
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.