Exploring big data traits and data quality dimensions for big data analytics application using partial least squares structural equation modelling

The popularity of big data analytics (BDA) has boosted the interest of organisations into exploiting their large scale data. This technology can become a strategic stimulation for organisations to achieve competitive advantage and sustainable growth. Previous BDA research, however, has focused more on introducing more traits, known as Vs for big data traits, while ignoring the quality of data when examining the application of BDA. Therefore, this study aims to explore the effect of big data traits and data quality dimensions on BDA application. This study has formulated 10 hypotheses that comprised of the relationships of big data traits, accuracy, believability, completeness, timeliness, ease of operation, and BDA application constructs. This study conducted a survey using a questionnaire as a data collection instrument. Then, the partial least squares structural equation modelling technique was used to analyse the hypothesised relationships between the constructs. The findings revealed that big data traits can significantly affect all constructs for data quality dimensions and that the ease of operation construct has a significant effect on BDA application. This study contributes to the literature by bringing new insights to the field of BDA and may serve as a guideline for future researchers and practitioners when studying BDA application.

can be linked to the inability of traditional database management tools to handle structured and unstructured data simultaneously [6]. Structured data refers to data that have a scheme, metadata, rules, and constraints to follows, whilst unstructured data have no structure at all or unknown structure to follow [7]. These types of data are collected or received from diverse platforms, such as network sensors, social media, and the Internet of Things.
Although it is vital to exploit structured and unstructured data for BDA, they are usually incomplete, inaccurate, inconsistent, and vague or ambiguous, which could lead to false decisions [8][9][10][11]. Salih et al. [12] and Wamba et al. [13] have highlighted the lack of data quality mechanisms being applied in BDA prior to data usage. Several studies have considered the potential of data quality for BDA application [14][15][16][17][18], yet, specific questions about what drives the dimensions of data quality remain unanswered. Nevertheless, studies on data quality and BDA are still underway and have not reach a good level of maturity [7]. Thus, there is an urgent need to conduct in-depth study on data quality to determine the most important dimensions for BDA application.
Several theories or models for understanding data quality problems have been suggested, such as resource-based theory (RBT), organisational learning theory (OLT), firm performance (FPER), and data quality framework (DQF). However, these theories or models do not fit into BDA application since they concentrate primarily on service quality as opposed to data quality [19]. Moreover, most studies related to BDA are focused on the perspective held at the organisational or firm level [8,10,20,21] and studies focusing on the individual perspective are lacking. Since academics are encouraged to participate in research on pedagogical support for teaching about BDA [22], this study has determined that university students can represent the perspectives at the individual level. Students were chosen because it is crucial to prepare and expose them to BDA, especially in the mandatory setting [23].
Meanwhile, numerous traits have been studied to explain the characteristics of big data, such as 3Vs [24], 4Vs [25], 5Vs [26,27], 7Vs [28], 9Vs [29], 10Vs [30], 10Bigs [31], and 17Vs [32]. These attempts to assign the maximum number of characteristics to big data show the lack of uniform consensus regarding the core of big data characteristics [33]. Although big data characteristics and data quality are viewed as distinct domains, several studies have found that these two domains are interconnected and closely related [9,14,17]. A better understanding of the core characteristics of big data and the dimensions of data quality is needed. Hence, this study seeks to expand the knowledge on big data characteristics, hereafter known as big data traits (BDT) and data quality dimensions (DQD), as well as to explore how they could affect the application of BDA.

Literature review
Big data and analytics are two different fields that are widely used to exploit the exponential growth of data in recent years. The term 'big data' represents a large volume of data, while the term 'analytics' indicates the application of mathematical and statistical tools on a collection of data [34]. These two terms have been merged into 'big data analytics' to represent various advanced digital techniques that are formulated to identify hidden patterns of information within gigantic data sets [35,36]. Scholars have suggested varying definitions for BDA. For instance, Verma et al. [23] defined BDA as a suite of data management and analytical techniques for handling complex data sets, which in turn lead to a better understanding of the underlying process. Faroukhi et al. [37] defined BDA as a process of analysing raw data in order to obtain information that is understandable to humans, which are hard to observe using direct analysis. Davenport [38] simply defined BDA as a "focus on very large, unstructured and fast moving data".
Nowadays, BDA application has helped numerous organisations improve their performance because it can handle problems instantly and assist organisations in making better and smarter decisions [35,39]. The advantages of BDA application for organisational performance have been proven by numerous studies. For instance, Mikalef et al. [20] found four alternative solutions surrounding BDA that can lead to higher performance, whereby different combinations of BDA resources either play a greater or lesser importance to organisational performance. Similarly, Wamba et al. [40] applied the RBT and sociomaterialism theory to examine organisational performance. Their empirical work showed that the hierarchical BDA has both direct and indirect impacts on organisational performance. Based on this same set of views, Wamba et al. [13] highlighted the importance of capturing the quality dimensions of BDA. Their findings proved the existence of a significant relationship between the quality of data in BDA and organisational performance.
Some scholars perceive data quality as equivalent to information quality [41][42][43][44]. Data quality generally refers to the degree to which the data are fit for use [45]. Meanwhile, the concept of information quality is defined as how well the information supports the task [46]. Haryadi et al. [14] asserted that data quality is focused on data that have not been analysed, while information quality is focused on the analysis that has been done on the data. This study, however, opines that data quality should focus on the wellness and appropriateness of data, which encompasses either before or after it has been analysed, in which it should meet the requirements of organisations [12].
The notion of quality represents a multidimensional construct, whereby it is essential to combine its dimensions and express them in a solid structure [46]. Initially, Wang and Strong [45] used factor analysis to identify DQD and found 179 dimensions that were eventually reduced to 20. Then, they organised these dimensions into four primary categories, namely intrinsic, contextual, representational, and accessibility. The intrinsic category denotes datasets that have quality in their own right, while the contextual category highlights the requirement of the task that data quality must be considered within the context. The representational category describes data quality in relation to the presentation of the data, and the accessibility category emphasises on the importance of computer systems that provide access to data [18]. Each category has several dimensions that are used as specific data quality measurements. For instance, accuracy and objectivity are the dimensions in the intrinsic category, while relevance and timeliness are the dimensions in the contextual category. Interpretability and understandability are the dimensions in the representational category, and access security and ease of operations are the dimensions in the accessibility category. Table 1 presents all DQD according to their categories.
Various studies have been conducted to analyse the relationships between DQD and BDA application. For instance, Côrte-Real et al. [8] analysed the direct and indirect effects of DQD on BDA capabilities in a multi-regional survey (European and American firms). Their findings showed that the DQD, primarily completeness, accuracy, and currency, have significant effects on BDA capabilities when process complexity was low. Thus, these authors have demonstrated the emergent need for firms to have effective data quality mechanisms to be able to derive sufficient value from BDA application. Ghasemaghaei and Calic [47] used OLT and the DQD compiled by Wang and Strong [45] to explain the effect of BDA on data quality categories. They found that while many organisations have invested in BDA application, they need to pay more attention to the quality of their data in order to enhance the quality of the solutions. Meanwhile, Ji-fan Ren et al. [48] examined the quality dynamics (system quality and information quality) in BDA using business value and FPER theories. Their study revealed that system quality can enhance information quality, which in turn, would affect organisational values and performance in the BDA environment. While these studies offer insights into the relationship between DQD and BDA, they have not highlighted the critical DQD that could impact BDA application.
DQD are also associated with the characteristics of big data, which are commonly known as big data traits (BDT). The BDT were originally defined by 3Vs (volume, velocity, and variety) [24]. These traits have been extended over the years, which include 4Vs (volume, velocity, variety, and value) [25], 5Vs (volume, velocity, variety, value, and veracity) [26,27], 7Vs (volume, velocity, variety, veracity, validity, volatility, and value) [28], 9Vs (veracity, variety, velocity, volume, validity, variability, volatility, visualisation, and value) [29], 10Vs (volume, value, velocity, veracity, viscosity, variability, volatility, viability, validity, and variety) [30], 10Bigs (big volume, big velocity, big variety, big veracity, big intelligence, big infrastructure, big service, big value, and big market) [31], and 17Vs (volume, velocity, value, variety, veracity, validity, volatility, visualisation, virality, viscosity, variability, venue, vocabulary, vagueness, verbosity, voluntariness, and versatility) [32]. Several studies have investigated the influence of BDT on DQD. Noorwali et al. [49] argued that there is a lack of scientific understanding of the general and specific requirements of BDT and DQD. They suggested that a more systematic analysis for both BDT and DQD is essential for reducing the number of missing quality requirements while accounting for BDT. Likewise, Lakshen et al. [50] argued on the various technical challenges that must be addressed before the potential of BDT and DQD can be fully realised. Haryadi et al. [14] confirmed that BDT and DQD are the central issues for implementing BDA.

Conceptual model and hypotheses
Based on the literature review in the previous section, this study proposes a new model for BDA application based on the integration of BDT and DQD, as depicted in Fig. 1. It should be noted that various applications can have different requirements, as not all dimensions and constructs are always applicable. Nevertheless, most studies on BDA application are focused on the organisation or firm levels and not on the individual level. Hence, this study is based on an individual's perception.

Big data traits
According to Sun [31], various Vs are used to define BDT, while conventional data quality is defined by a number of DQD [17]. Hence, this study considered BDT as a single construct because different Vs are overlapping with the DQD. DQD categories that are generally accepted and frequently used in the application of BDA were also included in this study, which were the intrinsic, contextual, and accessibility categories. The intrinsic category was chosen because of the importance of data correctness in BDA application, which is composed of two constructs, namely, accuracy and believability. Meanwhile, the contextual category was chosen because the application of BDA commonly depends on the context in which the data are used. This study considered two constructs in the contextual category, namely, completeness and timeliness. Finally, the accessibility category was chosen because the computer system needs to facilitate the accessing and storing of data in BDA application. Thus, the ease of operation is considered as a construct in the accessibility category for this study. The significant influence of BDT on the constructs of DQD, namely, accuracy, believability, completeness, timeliness, and ease of operation was explored through the following hypotheses: H1: Big data traits have a significant influence on accuracy.
H2: Big data traits have a significant influence on believability.
H3: Big data traits have a significant influence on completeness.
H4: Big data traits have a significant influence on timeliness.
H5: Big data traits have a significant influence on ease of operation.

Accuracy
Accuracy means that the data must depict facts accurately and the data must come from a valid source [45,51]. The effective use of BDA relies on the accuracy of data, which is necessary to produce reliable information [8]. As higher data accuracy may facilitate the routines and activities of BDA, this study proposes that accuracy is included as an enabler in BDA application. Hence, this study proposes the following hypothesis: H6: Accuracy has a significant influence on big data analytics application.

Believability
Believability represents the degree of which the data is considered valid and reliable [44]. There are concerns regarding the credibility of BDA findings due to insufficient insight into the trustworthiness of the data source [52]. Believability of data sources might be difficult to notice, as people may alter facts or even publish false information. Therefore, data sources need to be treated as believable in BDA application. Hence, this study proposes the following hypothesis: H7: Believability has a significant influence on big data analytics application.

Completeness
Completeness refers to the degree of which there is no lack of data and that the data are largely appropriate for the task at hand [45]. It also refers to the validity of the values of all components in the data [53]. As big data sources are rather large and the architectures are complicated, the completeness of data is crucial to avoid errors and inconsistencies in the outcome of BDA application. Hence, this study proposes the following hypothesis: H8: Completeness has a significant influence on big data analytics application.

Timeliness
Timeliness refers to the degree of which data from the appropriate point in time reflects truth [50]. Timeliness is identified as one of the most significant dimensions of data quality, since making decisions based on outdated data will ultimately lead to incorrect insights [54]. Additionally, the more rapidly data are being generated and processed, the better time the data will be used in BDA application [17]. Hence, this study proposes the following hypothesis: H9: Timeliness has a significant influence on big data analytics application.

Ease of operation
Ease of operation refers to the degree of which data can be easily merged, changed, updated, downloaded or uploaded, aggregated, reproduced, integrated, customised, and manipulated, as well as can be used for multiple purposes [45]. Users will undeniably face challenges and complexity to utilise BDA based on the technical approaches used for handling this technology [35]. If the BDA application is relatively easily to operate, the user would be willing to use it in a long-term. Hence, this study proposes the following hypothesis: H10: Ease of operation has a significant influence on big data analytics application.

Research methodology
The methodological procedures in this study were conducted in two phases, namely, research instrument, and data collection and analysis. Figure 2 shows the methodology in sequence.

Research instrument
This study used a survey questionnaire with two sections to explore the hypothesised relationships in the proposed conceptual model. The first section included questions related to the respondents' profiles, such as gender, year of study, and area of study, while the second section contained measurement of constructs with 28 indicators. These constructs were BDT, accuracy, believability, completeness, timeliness, ease of operation, and BDA application. The indicators to measure BDT (velocity, veracity, value, and variability) were self-developed based on the definitions proposed by Arockia et al. [32]. The accuracy, believability, completeness, timeliness, and ease of operation constructs, each with four indicators, were adapted from [8,47], and [48]. BDA application, with four indicators, was adapted from [23] and [55]. All indicators have been measured using the 7-point Likert scale, ranging from 1 (strongly disagree) to 7 (strongly agree). This questionnaire was pretested among academics and several items have been reworded to improve the clarity of the questions. The questionnaire was then used in a pilot test to confirm the reliability of all shortlisted constructs. This test involved 30 respondents, and the Cronbach's alpha values for all seven constructs on the reliability scale were found to be appropriate and acceptable.

Data collection and analysis
This study used random sampling to select respondents who have knowledge on BDA. As a preliminary study, 200 survey invitations were sent to Computer Science students at the National Defence University of Malaysia. These students were chosen because of the knowledge they had gained during the Big Data Analytics or Data Mining course that they attended previously. Data were collected through a web survey, which was conducted from July till August 2020. A total of 108 complete responses were received, resulting in 54% response rate. There were 84 male (77.78%) and 24 female (22.22%) respondents involved in this study. Most of the respondents were in their second year of study (52.78%) and in the area of artificial intelligence (50.93%). The key profiles of these respondents are shown in Table 2. Subsequently, the partial least squares structural equation modelling (PLS-SEM) was applied to analyse the survey-based crossectional data, since this technique is able to explain the variance in key target constructs [56]. This technique amalgamates the concepts of factor analysis and multiple regression in order to validate the measurement instruments and test the research hypotheses. Since PLS-SEM is a modest and practical technique to create rigour in a complex modeling [57], this study had also utilised this technique for analysing and validating the complex hypothesised relationships of the proposed model.

Analysis and results
The SmartPLS 3.2 package was used to perform the PLS-SEM analysis. Evaluation of the analysis began with the measurement model, followed by the structural model.

Measurement model
A measurement model was used to assess the reliability and validity of the constructs. The standard steps for assessing a measurement model are convergent validity and discriminant validity. Convergent validity was analysed by calculating the factor loading of the indicators, composite reliability (CR), and average variance extracted (AVE) [58]. The convergent validity results in Table 3 show that the factor loadings for all indicators are higher than 0.708, as suggested by Hair et al. [59], with the elimination of three indicators (AC3, BE2, and EO4) from the original 28 indicators. Meanwhile, the CR values ranged from 0.822 to 0.917, which exceeded the suggested value of greater than 0.7 [59]. An adequate AVE is 0.50 or greater, meaning that at least 50% of the variance of the constructs can be explained by its indicators [56]. As shown in Table 3, all AVE values range from 0.536 to 0.786, indicating that convergent validity of the measurement model is achieved.
Once the convergent validity has been successfully established, the discriminant validity was examined using the Fornell-Larcker criterion. The square root of AVE should be greater than the correlations among each construct [60]. Table 4 demonstrates that the square root of AVEs are greater in all cases than the off-diagonal elements in their corresponding row and column. Therefore, discriminant validity has been achieved.

Structural model
The structural model was used to examine the magnitude of the relationships among the constructs. The goodness of fit of the structural model can be assessed by examining the R 2 measure (the coefficient of determination) and the significance level of the path coefficients (β values) [56]. The results of the research model were satisfactory, demonstrating the R 2 value for BDA application at 0.444, which suggested that 44.4% of the variance in BDA application can be explained by DQD. Furthermore, the R 2 values for constructs of accuracy (42.6%), believability (45.5%), and completeness (33.2%) were also satisfactory, except for timeliness (26.8%) and ease of operation (31.5%) that were moderately explained by BDT. Figure 3 illustrates the results of the R 2 values from the Smart-PLS 3.2 software. The path coefficents of the structural model were calculated using bootstrap analysis (resampling = 5000) to assess their statistical significance. Table 5 shows the results of  Overall, H1 to H5 were the influence of BDT on DQD, whereas, only one hypothesis of DQD, H10, was identified as significant for evaluating the influence of ease of operation towards BDA application. The results have also shown that accuracy, believability,

Discussion
The present study has explored the BDT and DQD constructs for BDA application. The findings showed that the accessibility of DQD (ease of operation) can significantly influence BDA application. This result showed that the ease of obtaining data plays an important role in providing users with an effective access to reduce the digital divide in BDA application endeavour. This result is corroborated by the findings by Zhang et al. [61], who considered the ease of functional properties would ensure the quality of BDA application. Janssen et al. [9] similarly proposed that the easier it is to operate BDA, the more application systems would be integrated and are sufficient for handling this technology. Akter et al. [57] found significant influence of DQD (completeness, accuracy, format, and currency) on BDA application. On the other hand, the results of this study showed that accuracy, believability, completeness, and timeliness have no significant influence on the decision to apply BDA. These results were unexpected. These outcomes could be because the respondents were novice users, whom assumed the availability of technical teams to solve any accuracy, believability, completeness, and timeliness problems in BDA application.
Meanwhile, the four indicators of BDT (velocity, veracity, value, and variability) have shown significantly high impact on all constructs of DQD (accuracy, believability, completeness, timeliness, and ease of operation). These findings are in agreement with the results obtained by Wahyudi et al. [17], whereby high correlation was found between BDT, and timeliness and ease of operation. The significant influence of BDT on DQD showed interesting results, which demonstrated how users recognise the importance of BDT for assessing quality assessment results. This observation is in agreement with Taleb et al. [62], who claimed that BDT could enforce quality evaluation management to achieve quality improvements. The findings also showed that while many researchers have proposed numerous BDT, in this context, velocity, veracity, value, and variability are more critical for assessing data quality in BDA application.

Conclusion
This study has proposed the practical implications based on perspectives at the individual level. Individual perspectives are imperative since the resistance to use technology commonly originates from this level of users. Hence, the results of this study may be beneficial for organisations that have not yet agreed to implement BDA. They could use the results to have a sense of the possibilities from embracing this technology. This study has also shown the theoretical implications based on the incorporation of BDT as a single construct and DQD as an underpinning theory for the development of a new BDA application model. This study is the first to investigate the influence of BDT and DQD towards BDA application by individual level users.
Several limitations apply to the interpretation of the results in this study. First, the intrinsic and contextual data quality categories are inadequate to specify the DQD included in the proposed model. Future studies may include other DQD, such as objectivity and reputation to represent the intrinsic category. Meanwhile, value-added, relevancy, and appropriate amount of data can be used for measuring the contextual category. Second, the chosen undergraduate students who have knowledge on BDA were insufficient to generalise the individual level perceptions towards BDA application. Hence, future studies could include more experienced respondents, such as lecturers or practitioners. Third, although the sample size was statistically sufficient, a larger sample may be useful to reinforce the results of this study. Finally, although this study has attempted to bridge the gaps between BDT and DQD, future studies are encouraged to explore other constructs for better understanding of BDA application. For instance, future studies could explore the role of security and privacy concerns in BDA application since data protection is becoming more crucial due to recent big open data initiatives. Therefore, a novel BDA application model that can address security and privacy concerns may be worth exploring. Overall, the findings of this study have contributed to the body of knowledge in the BDA area and offered greater insights for BDA application initiators.
Abbreviations BDA: Big data analytics; DQD: Data quality dimensions; BDT: Big data traits; PLS-SEM: Partial least squares structural equation modelling.