A noisy dataset can contain contradictory data. Contradictory data is synonymous to incorrect data and it is important that such data be investigated and evaluated when analysing a noisy dataset. Different approaches to dealing with contradictory data have been proposed by different researchers. For example [1, 2] proposed methods for identifying and removing contradictory data in noisy datasets. However, the removal of contradictory data from a noisy dataset will increase the incompleteness in the dataset thereby reducing the soundness of any information from such set of data. It is therefore important to identify and evaluate contradictory instances when analysing a large and noisy dataset. This will improve the soundness of the analysis from such a dataset. Evidently, the analysis of big data is identified as the next frontier for innovation and advancement of technology [3, 4]. There is therefore the need to identify appropriate approaches to dealing with contradictions in a large and noisy dataset.
There are different forms of contradictions. For example, there are contradictions from the use of modal words, structural, subtle lexical contrasts, as well as world knowledge (WK) as evident in Natural Language Processing (NLP). Some contradictions in NLP can occur where there are antonyms, negations, and date/number mismatch [1, 2]. Contradictory data can exist in a single source dataset in instances where there are systematic errors, arbitrary errors or different value representations [5, 6]. A more common source for contradictions is federated data from multiple sources such as data exchange [7], data fusion [8, 9] and data warehousing [10, 11]. In addition, data that are retrieved from internet sources such as data from Blogs and social networking sites are likely to be contradictory.
A set of data consists of information about real world objects. Some examples of real world objects are your dog, house, or car. Real world objects ‘G’ can be associated with different attributes ‘M’ which may have many values ‘V’. For example, dogs can have different colours (attributes) which can be black, white or brown (values). Contradictory data can exist in any dataset when the data contain conflicting information. Consequently, an object (g ∈ G) that is associated with an attribute (m ∈ M), can contain contradictory values such that m is associated with A and ¬A. For example, the grade associated to a student’s score in a module can be said to be contradictory if it is associated with a ‘pass’ and a ‘fail’. Even so, the metadata of a dataset specifies what kind of dependencies exists between the different values of the attributes in the dataset. It provides descriptive information about the characteristics of a given item in a dataset [12]. As a result, contradictory data can be said to be evident in a noisy dataset, where the data does not conform to the metadata of the dataset. For example, a dataset about students can contain the information that a student passed mathematics in the result from a particular examination body and failed it in the result from another examination body in the same year. Such data is not contradictory where the metadata describes that a student can be assessed on multiple results from different examination bodies of the same year. But the same data will be regarded as contradictory where the metadata describes that a student cannot be assessed on multiple results from different examination bodies of the same year. Accordingly, an object (g ∈ G) that is associated with an attribute (m ∈ M), where m is associated with A and ¬A, can be described as contradictory if the dependencies existing between the different values of the attribute does not conform to the metadata of its dataset.
The importance of identifying and evaluating contradictory data in a noisy dataset cannot be overstated. It is stated in [4] that “noisy Big Data could be more valuable than tiny samples because general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge”. Ennals et al. [13] identify that the analysis of contradictions will enable the data user to recognise when the information he reads online is disputed and by what source. Marneffe, Rafferty and Manning explain in [14] how contradiction detection systems can be applied in intelligence reports, bioinformatics, and political candidate debates. Tsytsarau et al. [15] state that the usefulness of aggregation and analysis of sentiments based contradictions on the web includes the provision of the ability to track the evolution of contradictory opinions or discussions in the blogosphere. Kim and Zhai [16] outline the importance of generating a comparative summary of contradictory opinions. Leser and Freytag explain in [5] that the identification of the patterns in contradictory data will help in providing answers to questions like “Which are the conflict-causing attributes, values, or value pairs?” and “What kind of dependencies exists between the occurrence of contradictions in different attributes?”.
On the other hand, contradictory data existing in a large dataset can be difficult to visualise especially when traditional data analysis and visualisation tools are employed. As explained by Keim et al. and Keim [17, 18], traditional data processing tools such as (x, y) plots, linear and bar-charts, histogram, and pie charts are rendered ineffective when a dataset contains tens, hundreds or thousands of dimensions and when the dataset does not have natural mapping to the display space. This work explains how to visually identify contradictory values which are associated with mutually exclusive attributes in a large and noisy comma separated values (CSV) dataset. It addresses the challenge of using traditional data processing tools in visually identifying contradictions. It answers the research question “how can contradictions in mutually exclusive data of a large and noisy dataset, be visually identified?”
This paper presents the importance of identifying contradictions in a noisy dataset and how to apply mutual exclusion rule in identifying contradictory data. It presents novel approaches for visually identifying contradictory data in a large and noisy dataset. The authors herein explain how contradictory data can be mined and visually analysed using ConTra. ConTra is an application developed by the authors of this work. It applies the mutual exclusion rule in mining contradictory data of a noisy CSV dataset. Also, the authors evaluated Contra’s capability to identify contradictory data in different sizes of datasets.
Section two of this paper explains how mutual exclusion rule is applied in identifying contradictions. ConTra’s mutual exclusion approach to mining and visualising contradictory data is presented in section three. A description of a real life noisy dataset and the results of its analysis using ConTra are presented in section four. The performance evaluation of ConTra is presented in section five. A conclusion and discussions on the way forward is presented in section six.