Skip to main content
Fig. 22 | Journal of Big Data

Fig. 22

From: Cumulative deviation of a subpopulation from the full population

Fig. 22

Humboldt County, reporting whether no one in the household 14 or over speaks English very well, with scores being \(\log _{10}\) of the adjusted household income; \(n =\) 583; Kuiper’s statistic is \(0.1028 / \sigma = 6.062\), Kolmogorov’s and Smirnov’s is \(0.1028 / \sigma = 6.062\). A single, highly weighted outlying observation from the subpopulation corrupts a bin at the highest scores in each reliability diagram. The cumulative plot includes this outlying observation, too, but displays the observation as an unmistakable steep jump in the plotted curve; the constant slope of that steep jump shows that the corresponding high deviation between the subpopulation and the full population is due to a single highly weighted observation. This single observation has no effect on the slopes in the rest of the cumulative plot, so this problematic observation corrupts only the reliability diagrams and not the plot of cumulative differences—an analogue of Simpson’s Paradox of [9] afflicts the reliability diagrams but not the cumulative plot. The scalar summary statistics report highly statistically significant deviation commensurate with all the plots, albeit slightly less due to the steep jump in the cumulative plot. These behaviors are analogous to those displayed in Fig. 7 above (see also Remark 6 above)

Back to article page