Cumulative deviation of a subpopulation from the full population

Assessing equity in treatment of a subpopulation often involves assigning numerical"scores"to all individuals in the full population such that similar individuals get similar scores; matching via propensity scores or appropriate covariates is common, for example. Given such scores, individuals with similar scores may or may not attain similar outcomes independent of the individuals' memberships in the subpopulation. The traditional graphical methods for visualizing inequities are known as"reliability diagrams"or"calibrations plots,"which bin the scores into a partition of all possible values, and for each bin plot both the average outcomes for only individuals in the subpopulation as well as the average outcomes for all individuals; comparing the graph for the subpopulation with that for the full population gives some sense of how the averages for the subpopulation deviate from the averages for the full population. Unfortunately, real data sets contain only finitely many observations, limiting the usable resolution of the bins, and so the conventional methods can obscure important variations due to the binning. Fortunately, plotting cumulative deviation of the subpopulation from the full population as proposed in this paper sidesteps the problematic coarse binning. The cumulative plots encode subpopulation deviation directly as the slopes of secant lines for the graphs. Slope is easy to perceive even when the constant offsets of the secant lines are irrelevant. The cumulative approach avoids binning that smooths over deviations of the subpopulation from the full population. Such cumulative aggregation furnishes both high-resolution graphical methods and simple scalar summary statistics (analogous to those of Kuiper and of Kolmogorov and Smirnov used in statistical significance testing for comparing probability distributions).


Introduction
Analysis of whether a subpopulation has had equitable treatment often assesses whether similar individuals attain similar outcomes irrespective of the individuals' memberships in the subpopulation. To effect the comparison, each individual gets a real-valued "score" used to match that individual with other similar individuals (such matching could use propensity scores or a suitable covariate, for instance). Each individual also gets a real-valued outcome of the treatment. A formal assessment could then consider how much the average outcome for individuals with a given score that belong to the subpopulation deviates from the average outcome for all individuals from the full population that have that same score.
When there are only finitely many observations, questions of statistical significance arise; for example, if the scores are predicted probabilities and the outcomes are drawn from independent (but not necessarily identically distributed) Bernoulli distributions with parameters given by the predicted probabilities, then the average outcome for individuals with a given score fluctuates across different random samples defining the subpopulation being analyzed. In such scenarios, the average outcome for the observed subpopulation would be expected to deviate stochastically from the average for the full population. Furthermore, each individual in the sampled subpopulation may very well have a different score from all the others, requiring some aggregation of scores in order to average away the statistical noise. The conventional approach is to partition the scores into some number of bins and calculate averages separately for every bin, trading off resolution in the scores for increased confidence in the statistical estimates. The present paper proposes an alternative based on cumulative statistics that avoids the necessarily arbitrary binning (binning that typically follows heuristics discussed shortly); the cumulative approach yields graphical methods as well as scalar summary statistics that are similar to the Kolmogorov-Smirnov and Kuiper metrics familiar from statistical significance testing for comparing probability distributions.
More concretely, subpopulations commonly considered include those associated with protected classes (such as those defined by race, color, religion, gender, national origin, age, disability, veteran status, or genetic information) and those associated with biomedicine (such as diseased, infected, treated, or recovered). The present paper discusses generic methodology applicable to all cases, illustrating the methods via interesting subpopulations that are unlikely to court controversy via their consideration; this paper avoids giving examples based on subpopulations defined by sensitive classes of direct interest to hot-button issues. The illustrative examples presented below are deliberately anodyne (but hopefully still sufficiently engaging) to avoid distracting from the focus on the statistical methodology being proposed.
Mathematically speaking, we consider m real-valued observations R 1 , R 2 , . . . , R m of the outcomes of independent trials with corresponding real-valued scores S 1 , S 2 , . . . , S m (where the scores very well may determine the probability distributions of the trials; for example, if 0 ≤ S i ≤ 1, then R i could be the outcome of a Bernoulli trial whose probability of success is S i ). We view R 1 , R 2 , . . . , R m (but not S 1 , S 2 , . . . , S m ) as random. Without loss of generality, we order the scores (preserving the pairing of R i with S i for every i) such that S 1 ≤ S 2 ≤ · · · ≤ S m , ordering any ties at random, perturbed so that S 1 < S 2 < · · · < S m . We consider a subset of indices corresponding to members of a subpopulation of interest, say i 1 , i 2 , . . . , i n , with n < m; without loss of generality, we order the indices such that 1 ≤ i 1 < i 2 < · · · < i n ≤ m. Each observation R i and score S i may also come with a positive weight W i ; however, we focus first on the simpler, more common case of uniform weighting, in which W 1 = W 2 = · · · = W m , and generalize to arbitrary weights later (in Subsection 2.5 below).
The classical methods require choosing some partitions of the real line into disjoint intervals with endpoints B 1 , B 2 , . . . , B −1 and another (possibly the same) disjoint intervals with endpointsB 1 ,B 2 , . . . ,B −1 such that B 1 < B 2 < · · · < B −1 andB 1 <B 2 < · · · <B −1 . We can then form the averages for the subpopulation and for the full populationỸ for k = 1, 2, . . . , , under the convention that B 0 =B 0 = −∞ and B =B = ∞. We also calculate the average scores in the bins for the subpopulation and for the full populationX for k = 1, 2, . . . , , under the same convention that B 0 =B 0 = −∞ and B =B = ∞. A graphical method for assessing the deviation of the subpopulation from the full population is then to scatterplot the pairs (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X , Y ) in black and the pairs (X 1 ,Ỹ 1 ), (X 2 ,Ỹ 2 ), . . . , (X ,Ỹ ) in gray. Comparing the black plotted points (possibly connected with black lines) to the gray plotted points (possibly connected with gray lines) then indicates how much the subpopulation deviates from the full population. Especially when assessing the calibration or reliability of probabilistic predictions, this graphical method is known as a "reliability diagram" or "calibration plot," as reviewed, for example, by [1] and [2]. A full review of the literature is available in Subsection A.1 of Appendix A. There are at least two common choices of the bins whose endpoints are B 1 , B 2 , . . . , B −1 (and similarly for the bins whose endpoints areB 1 ,B 2 , . . . ,B −1 ). The first is to make B 1 , B 2 , . . . , B −1 be equispaced. The second is to select B 1 , B 2 , . . . , B −1 such that the number of scores from the subpopulation that fall in the kth bin, that is, #{j : B k−1 < S ij ≤ B k }, is the same for all k (aside from the rightmost bin, that for k = , if n is not perfectly divisible by ). In both cases, the number of bins is a parameter that we can vary to trade-off higher-confidence estimates for finer resolution in detecting deviation as a function of score (and vice versa). Unfortunately, no choice can fully offset how the difference between the subpopulation and the full population is typically the primary interest, whereas the standard plot bins the subpopulation and the full population separately (potentially smoothing away information due to the discretization). Plotting the difference directly would solve this particular problem. Even then, however, no choice of bins can be optimal for all possible distributions of scores or for all possible distributions of deviations between the subpopulation and the full population. Binning will always discretize the distributions, smoothing away potentially important information.
Fortunately, binning so coarsely is unnecessary in the methods proposed below. The methods discussed below employ exactly one bin per score S ij , j = 1, 2, . . . , n, thus viewing the subpopulation on its own terms, dictated by the observations from the subpopulation. Employing only one bin per score would be nonsensical in the classical plots, as then the classical plots would trade-off all statistical confidence for the finest resolution possible. The cumulative methods sacrifice no resolution that is available in the observed data for the subpopulation.
For a simple illustrative example, Figure 1 displays both the conventional reliability diagrams as well as the cumulative plot proposed below. A detailed description is available in Subsection 3.1 below. The lowermost two rows of Figure 1 are the classical diagrams, with m = 50,000 and n = 5,000; there are = 10 bins for each diagram in the second row and = 50 for each diagram in the third row. In the lowermost two rows, the bins are equispaced along the scores in the leftmost plots, whereas each bin contains the same number of scores from S i1 , S i2 , . . . , S in (or from S 1 , S 2 , . . . , S m ) in the rightmost plots. The diagram of Figure 2 plots the ordered pairs (S 1 , P 1 ), (S 2 , P 2 ), . . . , (S m , P m ) in gray and (S i1 , P i1 ), (S i2 , P i2 ), . . . , (S in , P in ) in black, where P 1 , P 2 , . . . , P m are the expected values of R 1 , R 2 , . . . , R m , respectively; for this example, R 1 , R 2 , . . . , R m are drawn independently from Bernoulli distributions with probabilities of success P 1 , P 2 , . . . , P m . Thus, Figure 2 depicts the "ground-truth" expectations that the lowermost two rows of classical plots in Figure 1 are trying to characterize. The gray points correspond to the full population, while the solid black points correspond to the subpopulation.
The topmost row of Figure 1 displays both the cumulative plot introduced below as well as its ideal noiseless "ground-truth" constructed using the expected values P 1 , P 2 , . . . , P m of the random observations R 1 , R 2 , . . . , R m . Leaving elucidation of the cumulative plots and their construction to Section 2 below, we just point out here that the deviation of the subpopulation from the averages over the full population across an interval is equal to the slope of the secant line for the graph over that interval, aside from expected stochastic fluctuations detailed in Subsection 2.4 below. Steep slopes correspond to substantial deviations across the ranges of scores where the slopes are steep, with the deviation over an interval exactly equal to the expected value of the slope of the secant line for the graph over that interval. The cumulative plot closely resembles its ideal ground-truth in Figure 1, and the Kolmogorov-Smirnov and Kuiper metrics conveniently summarize the statistical significance of the overall deviation across all scores, in accord with Subsections 2.3 and 2.4 below.
The structure of the remainder of the present paper is as follows: Section 2 details the statistical methodology that Section 3 illustrates via numerical examples. 1 Section 3 also proposes avenues for further development. Section 4 then briefly summarizes the methods and numerical results. Appendix A describes methods for calibrating probabilistic predictions that are analogous to the cumulative methodology introduced in Section 2 for assessing deviation of a subpopulation from the full population. Appendix A also contains a literature review, in Subsection A.1; please consult Subsection A.1 for a discussion of related work. Appendix B warns about potentially tempting overinterpretations of the various plots. Table 1 gives a glossary of the notation used throughout except in the appendices, while Table 2 gives a glossary of the notation used in the appendices.

Methods
This section formulates the cumulative statistics mathematically, with Subsection 2.1 proposing a workflow for largescale analysis. Subsection 2.2 details the graphical methods. Subsection 2.3 details the scalar summary metrics. Subsection 2.4 discusses statistical significance for both the graphical methods and the summary statistics. Subsection 2.5 presents a generalization of these statistical methodologies to the case of weighted samples (beyond just equally or uniformly weighted).

High-level strategy
This subsection suggests a hybrid method for large-scale data analysis. When there are many data sets and subpopulations to assess, a two-step approach may be the most practical: 1. A screening stage assigning a single scalar summary statistic to each pair of data set and subpopulation (where the size of the statistic measures the deviation of the subpopulation from the full population).
2. A detailed drill-down for each pair of data set and subpopulation whose scalar summary statistic is large, drilling down into how the deviation of the subpopulation from the full population varies as a function of score.
The drill-down relies on graphically displaying the deviation of the subpopulation from the full population as a function of score; the scalar statistic for the first stage simply summarizes the overall deviation across all scores, as   Figure 2 displays the ground-truth reliability diagram. The reliability diagram with 50 bins that each contain the same number of scores from the subpopulation is able to detect the notch around scores of 0.25; however, the oscillation of the bin frequencies for the subpopulation complicates disentangling real variations from statistical noise. The reliability diagrams that each have only 10 bins exhibit fewer random oscillations, but smear out the notch. In the reliability diagrams, the averages for the subpopulation are black, while the averages for the full population are gray. In the top row, the plot of cumulative deviation resolves the notch nicely while displaying minimal random fluctuations across the full range of scores. The scalar summary statistics of Kuiper and of Kolmogorov and Smirnov very successfully detect the statistically significant deviation of the subpopulation from the full population.
either the maximum absolute deviation of the graph or the size of the range of deviations in the graphical display. Thus, for each pair of data set and subpopulation, both stages leverage a graph; the first stage collapses the graph into a single scalar summary statistic. The following subsection constructs the graph.

Graphical method
This subsection details the construction of cumulative plots. Two sequences will define the cumulative plot based on the notation set in the introduction. The cumulative sequence for the subpopulation is for k = 1, 2, . . . , n.
Facilitating comparison of the subpopulation with the full population, the average result for the full population in a bin around S i k isR for k = 1, 2, . . . , n, where the thresholds for the bins are for k = 0, 1, 2, . . . , n, under the convention that S i0 = −∞ and S in+1 = ∞ (so B 0 = −∞ and B n = ∞). The cumulative sequence for the full population at the subpopulation's subset of scores is for k = 1, 2, . . . , n, whereR ij is defined in (6). Although the accumulation from lower scores might at first glance appear to overwhelm the contributions from higher scores, a plot of F k −F k as a function of k will display deviation of the subpopulation from the full population for any score solely in slopes that deviate significantly from 0; any problems accumulated from earlier, lower scores pertain only to the constant offset from 0, not to the slope deviating from 0. In fact, the increment in the expected thus, on a plot with the values for k spaced 1/n apart, the slope from j = k − 1 to j = k is where the expected value of the average resultR i k in the bin around S i k is equal to the average of the expected results, that is, for k = 1, 2, . . . , n, andR i k is defined in (6). The subpopulation deviates from the full population for the scores near S i k when ∆ k is significantly nonzero, that is, when the slope of the plot of F k −F k deviates significantly from horizontal over a significantly long range.
To emphasize: the deviation of the subpopulation from the full population over a contiguous range of S i k is the slope of the secant line for the plot of F k −F k as a function of k n over that range, aside from expected random fluctuations. Figure 1 presents a simple illustrative example, and many examples analyzing data from a popular data set in computer vision, "ImageNet," are available in Subsection 3.2. The leftmost plot in the topmost row of Figure 1 graphs F k −F k versus k/n; the rightmost plot in the top row is the ideal noiseless "ground-truth" constructed using the precise expected values of the random observations R 1 , R 2 , . . . , R m (as detailed in Subsection 3.1, the exact expectations are available for constructing the ground-truth in the synthetic example corresponding to Figure 1).
Steep slopes correspond to substantial deviations across the ranges of scores where the slopes are steep, with the deviation over an interval exactly equal to the expected value of the slope of the secant line for the graph over that interval. The cumulative plot nicely matches its ideal ground-truth in Figure 1.
The following subsection discusses metrics summarizing how much the graph deviates from 0 (needless to say, if the slopes of the secant lines are all nearly 0, then the whole graph cannot deviate much from 0). Subsection 2.4 then leverages these metrics in a discussion of the expected stochastic fluctuations mentioned above.

Scalar summary statistics
This subsection details the construction of scalar statistics which provide broad-brush summaries of the plots introduced in the previous subsection.
Two standard metrics for the overall deviation of the subpopulation from the full population over the full range of scores that account for expected random fluctuations are that due to Kolmogorov and Smirnov, the maximum absolute deviation and that due to Kuiper, the size of the range of the deviations where F 0 = 0 =F 0 ; the following remark explains the reason for including F 0 andF 0 (a reason that often makes D modestly preferable to G). Under appropriate statistical models, G and D can form the basis for tests of statistical significance, the context in which they originally appeared; see, for example, Section 14.3.4 of [3]. For assessing statistical significance (rather than overall effect size), G and D should be rescaled larger by a factor proportional to √ n; further discussion of the rescaling is available in the next subsection. The captions of the figures report the values of these summary statistics for all examples. Remark 1. The statistic D from (13) has the same statistical power across all indices. Indeed, shifting the index in the definitions of F k andF k where the summation starts has no effect on the value of D, for the following reasons: Recall that an integral whose lower limit of integration is greater than the upper limit is simply the negative of the integral with the lower and upper limits of integration interchanged. Thus, the natural generalization when starting from arbitrary values of j the summations defining F k in (5) andF k in (8) is andF for k = 0, 1, 2, . . . , n; = 0, 1, 2, . . . , n, with the case = 0 reducing to (5) and (8). Of course, the first summation in the right-hand side of (14) vanishes when k ≤ and the second summation vanishes when k ≥ ; similarly, the first summation in the right-hand side of (15) vanishes when k ≤ and the second summation vanishes when k ≥ . Consideration of each case, for k < , for k = , and for k > , yields that for k = 0, 1, 2, . . . , n; = 0, 1, 2, . . . , n. The definition when combined with (16) and (17), then yields that for = 0, 1, 2, . . . , n, where D is from (13). This shows that the statistic D has the same statistical power for any index, as the statistic is invariant to shifts in where the summation for the cumulative differences starts.

Significance of stochastic fluctuations
This subsection discusses statistical significance both for the graphical methods of Subsection 2.2 and for the summary statistics of Subsection 2.3. The plot of F k −F k as a function of k/n automatically includes some "confidence bands" courtesy of the discrepancy F k −F k fluctuating randomly as the index k increments -at the very least, the "thickness" of the plot coming from the random fluctuations gives a sense of "error bars." To give a rough indication of the size of the fluctuations of the maximum deviation expected under the hypothesis that the subpopulation does not deviate from the full population in the actual underlying distributions, the plots should include a triangle centered at the origin whose height above the origin is proportional to 1/ √ n. Such a triangle can be a proxy for the classic confidence bands around an empirical cumulative distribution function introduced by Kolmogorov and Smirnov, as reviewed by [4]. Indeed, a driftless, purely random walk deviates from zero by roughly √ n after n steps, so a random walk scaled by 1/n deviates from zero by roughly 1/ √ n. Identification of deviation between the subpopulation and the full population is reliable when focusing on long ranges (as a function of k/n) of steep slopes for F k −F k ; the triangle gives a sense of the length scale for variations that arise solely due to randomness even in the absence of any actual underlying deviation between the subpopulation and the full population. A simple illustrative example is available in Figure 1, and many examples analyzing data from a popular data set in computer vision, "ImageNet," are available in Subsection 3.2.
In cases for which either R i = 0 or R i = 1 for each i = 1, 2, . . . , m, and for which the scores are nothing but the probabilities of success, that is, S i is the probability that R i = 1 (where i = 1, 2, . . . , m), the tip-to-tip height of the triangle centered at the origin should be 4/n times the standard deviation of the sum of independent Bernoulli variates with success probabilities S i1 , S i2 , . . . , S in , that is, 4 n k=1 S i k (1 − S i k )/n. This height will be representative to within a factor of √ 2 or so provided that the subpopulation is a minority of the full populationsee Remark 7 and Subsubsection A.2.4 of Appendix A below. Needless to say, similar remarks pertain whenever the variance of R i is a known function of S i for each i = 1, 2, . . . , m.
In cases for which either R i = 0 or R i = 1 for each i = 1, 2, . . . , m, and for which there are many scores from the full population in the bin for each score from the subpopulation, that is, #{i : B k−1 < S i ≤ B k } is large for every k = 1, 2, . . . , n, the average of the outcomes for each bin will be a good approximation to the average of the underlying probabilities of success for that bin, that is, for k = 1, 2, . . . , n, where B k is from (7), E[R i k ] is from (11), and E[R i ] is the probability that the outcome is a success, that is, the probability that R i = 1. In such cases, the tip-to-tip height of the triangle at the origin should be 4/n times the standard deviation of the sum of independent Bernoulli variates with success probabilities E[R i1 ], /n -and we may use (20) to approximate this height as four times whereR i k is the left-hand side of (20), as seen from (6). The triangles in the figures all have this height when each R i is either 0 or 1, since the numerical results reported in the following section pertain to the case in which there are quite a few scores from the full population in the bin for each score from the subpopulation. When R i can take on more than two possible values, the figures instead use the empirical variance of the members from the full population for each bin from B k−1 to B k , that is, we replace (21) with where the empirical variance is for k = 1, 2, . . . , n, withR i k defined in (6);R i k is also the left-hand side of (20).
[Rigorous justification of (20) is straightforward: the expected value of the left-hand side of (20) is the right-hand side of (20), and 0 ≤ E[R i ] ≤ 1 implies that E[R i ] (1 − E[R i ]) ≤ 1/4, so the standard deviation of the left-hand side of (20) is which converges to 0 as #{i : Remark 2. Interpreting the scalar summary statistics G and D from (12) and (13) is straightforward in these latter cases, using σ defined in (21) or (22). Indeed, under the null hypothesis that the subpopulation has no deviation from the average values of the full population at the corresponding scores, the expected value of G/σ is less than or equal to the expected value of the maximum (over a subset of the unit interval [0, 1]) of the absolute value of the standard Brownian motion over [0, 1], in the limit n → ∞ and #{i : B k−1 < S i ≤ B k } → ∞ for all k = 1, 2, . . . , n.
As reviewed below in Remark 7, the expected value of the maximum of the absolute value of the standard Brownian motion over the unit interval [0, 1] is π/2 ≈ 1.25; and the discussion by [5] immediately following Formula 44 of the associated arXiv publication 2 shows that the probability distribution of the maximum of the absolute value of the standard Brownian motion over [0, 1] is sub-Gaussian, decaying past its mean π/2 ≈ 1.25. So, values of G/σ much greater than 1.25 imply that the subpopulation's deviation from the full population is significant, while values of G/σ close to 0 imply that G did not detect any statistically significant deviation. Similar remarks pertain to D, since G ≤ D ≤ 2G.

Weighted sampling
This subsection generalizes the methods of the preceding subsections to the case of weighted samples. Specifically, some data sets include a weight for how much each observation should contribute to the data analysis. In the setting of Subsection 2.2 above, such a data set would supplement the results R 1 , R 2 , . . . , R m and scores S 1 , S 2 , . . . , S m with positive weights W 1 , W 2 , . . . , W m . Subsection 2.2 will correspond to the special case that W 1 = W 2 = · · · = W m (admittedly, the "special" case is the standard in practice).
In a plot of the weighted cumulative differences F 1 −F 1 , F 2 −F 2 , . . . , F n −F n from (25) and (27) versus the cumulative weights A 1 , A 2 , . . . , A n from (28), that is, in a plot where F 1 −F 1 , F 2 −F 2 , . . . , F n −F n are the ordinates (vertical coordinates), and A 1 , A 2 , . . . , A n are the corresponding abscissae (horizontal coordinates), the expected value of the slope from A k−1 to A k is where the expected value of the weighted average resultR i k in the bin around S i k is equal to the weighted average of the expected results, that is, for k = 1, 2, . . . , n, andR i k is defined in (26). Thus, ∆ k defined in (29) is equal to the expected value of the deviation of the subpopulation from the full population for the scores near S i k , with W i k canceling in the rightmost identity of (29). Hence, the subpopulation deviates from the full population for the scores near S i k when ∆ k is significantly nonzero, that is, when the slope of the plot of F k −F k versus A k deviates significantly from horizontal over a significantly long range.
To emphasize: the deviation of the subpopulation from the full population over a contiguous range of S i k is the slope of the secant line for the plot of F k −F k as a function of A k over that range, aside from expected random fluctuations.
A simple illustrative example is available in Figure 7 of Subsection 3.1, and many examples analyzing data from the U.S. Census Bureau are available in Subsection 3.3.
The slope of line segments connecting the points in the plot of F k −F k versus A k is constant between successive values of k, and those successive values are spaced further apart on the horizontal axis when the weight W i k is larger. A plotted line that is straight for a wide horizontal range is therefore indicative of a large weight. Moreover, setting the (major) ticks on the upper horizontal axis at the positions corresponding to equispaced values for k visually depicts the distribution of weights; including equispaced minor ticks on the same upper horizontal axis provides a comparison to the case of uniform weighting.
In cases for which either R i = 0 or R i = 1 for each i = 1, 2, . . . , m, and for which there are many scores from the full population in the bin for each score from the subpopulation, that is, #{i : B k−1 < S i ≤ B k } is large for every k = 1, 2, . . . , n, the tip-to-tip height of the triangle at the origin should be four times whereR i k is defined in (26); that is, we replace (21) with (31). When R i can take on more than two possible values, the figures instead use the empirical variance of the members from the full population for each bin from B k−1 to B k , that is, we replace (22) and (31) with where the empirical variance is for k = 1, 2, . . . , n, withR i k defined in (26). The numerators of (31) and (32) include the square (W i k ) 2 , unlike the other formulae. The scalar summary statistics due to Kuiper and to Kolmogorov and Smirnov of course use the same formulae (12) and (13) as in the unweighted (or uniformly weighted) case, except for replacing the definition of F k from (5) with the definition from (25) and the definition ofF k from (8) with the definition from (27).

Remark 4.
We can adapt to the case of weighted sampling the classical methods discussed in the introduction. As in the introduction, we choose some partitions of the real line into disjoint intervals with endpoints B 1 , B 2 , . . . , B −1 and another (possibly the same) disjoint intervals with endpointsB 1 ,B 2 , . . . ,B −1 such that B 1 < B 2 < · · · < B −1 andB 1 <B 2 < · · · <B −1 , and then replace (1) with the weighted averages for the subpopulation and replace (2) with the weighted averages for the full populatioñ for k = 1, 2, . . . , , under the convention that B 0 =B 0 = −∞ and B =B = ∞. We also replace (3) with the weighted averages of the scores in the bins for the subpopulation and replace (4) with the weighted averages for the full populatioñ for k = 1, 2, . . . , , under the same convention that B 0 =B 0 = −∞ and B =B = ∞. The reliability diagram for assessing the deviation of the subpopulation from the full population is then the scatterplot of the pairs (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X , Y ) in black and the pairs (X 1 ,Ỹ 1 ), (X 2 ,Ỹ 2 ), . . . , (X ,Ỹ ) in gray. Comparing the black plotted points (possibly connected with black lines) to the gray plotted points (possibly connected with gray lines) gives an indication of deviation of the subpopulation from the full population. Two natural choices of the bins whose endpoints are B 1 , B 2 , . . . , B −1 (similar choices pertain to the bins whose endpoints areB 1 ,B 2 , . . . , has a similar value for all k = 1, 2, . . . , . Remark 5 below details the procedure we followed in the second case; the plots entitled, "reliability diagram ( W 2 / W 1 is similar for every bin)," display this second possible choice of bins. The plots entitled simply, "reliability diagram," display the first possible choice of bins. In the special case that the weights are uniform, that is, W 1 = W 2 = · · · = W m , the second choice of bins results in every bin containing about the same number of scores, with U 1 ≈ U 2 ≈ · · · ≈ U ≈ /n, and similarlyŨ 1 ≈Ũ 2 ≈ · · · ≈Ũ ≈ /m, wherẽ for k = 1, 2, . . . , .
Remark 5. In the case of weighted sampling, the most useful reliability diagrams are usually those entitled, "reliability diagram ( W 2 / W 1 is similar for every bin)." These diagrams construct bins with endpoints B 0 , B 1 , . . . , . These diagrams also construct˜ bins with endpoints B 0 ,B 1 , . . . ,B˜ such thatŨ 1 ≈Ũ 2 ≈ · · · ≈Ũ˜ , whereŨ k is defined in (39). The algorithmic details are as follows: Given a value U for which hopefully U k ≈ U for all k = 1, 2, . . . , , we set B 0 = −∞ and, iterating from k = 1 to k = , incrementally increase B k to the least value greater than B k−1 such that U k ≤ U . If this causes the bin (the bin for the highest scores) to contain less than half as many subpopulation observations as bin − 1, that is, #{j : , then we merge bin with bin − 1. In the aforementioned algorithm, we computed U via the formula where p 1 , p 2 , . . . , p n are a uniformly random permutation of the integers i 1 , i 2 , . . . , i n , and¯ is the desired number of bins. Calculating U via the heuristic (40) worked well for all examples reported below, producing U 1 ≈ U 2 ≈ · · · ≈ U , with close to¯ ; the procedure yieldingŨ 1 ≈Ũ 2 ≈ · · · ≈Ũ˜ is similar (in fact, the implementation is identical, viewing the full population as a subpopulation of itself).

Results and discussion
This section illustrates via numerous examples the previous section's methods, together with the traditional plots -so-called "reliability diagrams" -discussed in the introduction. 3  The figures display the classical calibration plots ("reliability diagrams") as well as both the plots of cumulative differences and the exact expectations in the absence of noise from random sampling when the exact expectations are known (as with synthetically generated data). The captions of the figures discuss the numerical results depicted.
The title, "subpopulation deviation is the slope as a function of k/n," labels a plot of F k −F k from (5) and (8) as a function of k/n. In each such plot, the upper axis specifies k/n, while the lower axis specifies S i k for the corresponding value of k. The title, "subpopulation deviation is the slope as a function of A k ," labels a plot of F k −F k from (25) and (27) versus the cumulative weight A k from (28). In each such plot, the major ticks on the upper axis specify k/n, while the major ticks on the lower axis specify S i k for the corresponding value of k; the points in the plot are the ordered pairs (A k , F k −F k ) for k = 1, 2, . . . , n, with A k being the abscissa and F k −F k being the ordinate.
The titles, "reliability diagram," "reliability diagram (equal number of subpopulation scores per bin)," and "reliability diagram ( W 2 / W 1 is similar for every bin)," label a plot of the pairs ( (1) To give a sense of the uncertainties in the classical, binned plots, we vary the number of bins and observe how the plotted values vary. Displaying the bin frequencies is another way to indicate uncertainties, as suggested, for example, by [6]. Still other possibilities could use kernel density estimation, as suggested, for example, by [7] and [8]. Such uncertainty estimates require selecting widths for the bins or kernel smoothing; varying the widths (as done in the present paper) avoids having to make what would otherwise be a rather arbitrary choice. A thorough survey of the various possibilities is available in Chapter 8 of [8].
As noted in the introduction, there are two canonical choices for the bins in the case of unweighted (or uniformly weighted) sampling: {1} make the average of S i k (or S i ) in each bin be approximately equidistant from the average of S i k (or S i ) in each neighboring bin or {2} make the number of S i k (or S i ) in every bin (except perhaps for the last) be the same. The figures label the first, more conventional possibility with the short title, "reliability diagram," and the second possibility with the longer title, "reliability diagram (equal number of subpopulation scores per bin)." As noted in Remark 4, there are two natural choices for the bins in the case of weighted sampling: {1} make the weighted average of S i k (or S i ) in each bin be approximately equidistant from the weighted average of S i k (or S i ) in each neighboring bin or {2} follow Remark 5 above. The figures label the first possibility with the short title, "reliability diagram," and the second possibility with the longer title, "reliability diagram ( W 2 / W 1 is similar for every bin)." Setting the number of bins together with any of these choices fully specifies the bins. As discussed earlier, we vary the number of bins, since there is no perfect setting -using fewer bins offers higher-confidence estimates, yet limits the resolution for detecting deviations and for assessing how the deviations vary as a function of S i k .

Synthetic
In this subsection, the examples draw observations at random from various statistical models so that the underlying "ground-truth" is available. To generate the corresponding figures, Figures 1-8, we specify values for the scores S 1 , S 2 , . . . , S m and for the indices i 1 , i 2 , . . . , i n . We also select probabilities P 1 , P 2 , . . . , P m and then independently draw the outcomes R 1 , R 2 , . . . , R m from the Bernoulli distributions with parameters P 1 , P 2 , . . . , P m , respectively.  (5) and (8) as a function of k/n, with the rightmost plot displaying its noiseless expected value rather than using the random observations R 1 , R 2 , . . . , R m . The top row of Figure 7 plots F k −F k from (25) and (27) versus the cumulative weight A k from (28), with the rightmost plot displaying its noiseless expected value rather than using the random observations R 1 , R 2 , . . . , R m . Figures 2, 4, 6, and 8 plot the pairs (S 1 , P 1 ), (S 2 , P 2 ), . . . , (S m , P m ) in gray, and plot the pairs (S i1 , P i1 ), (S i2 , P i2 ), . . . , (S in , P in ) in black, producing ground-truth diagrams that the lowermost two rows of plots from the associated Figures 1, 3, 5, and 7 are trying to estimate using only the observations R 1 , R 2 , . . . , R m , without access to the underlying probabilities P 1 , P 2 , . . . , P m .
For the examples, we consider full populations which are mixtures of many subpopulations, including for each example one specific subpopulation that we analyze for deviations from the average over the full population. To make the models realistic, we assign expected outcomes to the members of the full population such that the expected outcomes span a continuous range which includes the expected outcomes attained by members of the subpopulation being analyzed for deviations. The reliability diagrams use gray points and gray connecting-lines to indicate the full population, and solid black points and black connecting-lines to indicate the specific subpopulation under consideration.
For the first example, corresponding to Figures 1 and 2, we consider a mixture of subpopulations with a reasonably wide range of expected outcomes for most ranges of scores, except for a narrow notch around scores of 0.25 which lacks the significant subpopulation deviation. We select for the specific subpopulation analyzed for deviations a subpopulation with among the highest expected outcomes around every score. The scores are S j = ((j − 0.5)/m) 2 for j = 1, 2, . . . , m, thus more concentrated near 0 than near 1. For the indices i 1 , i 2 , . . . , i n of the subpopulation, we start with the integer multiples of 20 and then add three levels of refinement increasingly focused around m/2, where m/2 corresponds to the middle of the notch. The total number of scores considered is m = 50,000, and (following the refinement) the number of subpopulation indices is n = 5,000.
For the second example, corresponding to Figures 3 and 4, we consider a mixture of subpopulations that includes one whose expected outcomes oscillate smoothly between the minimum and the maximum of the expected outcomes of all subpopulations as a function of the score; we analyze that particular one subpopulation for deviations from the full population. The scores are S j = (j − 0.5)/m for j = 1, 2, . . . , m, hence equispaced between 0 and 1. The total number of scores considered is m = 50,000. For the indices i 1 , i 2 , . . . , i n of the subpopulation, we raise to the power 4/3 each positive integer, then round to the nearest integer, and finally retain the lowest n = 3,300 unique resulting integers.
For the third example, corresponding to Figures 5 and 6, we consider a mixture of subpopulations similar to that for the second example, but this time selecting for analysis a subpopulation whose expected outcomes oscillate in discrete steps between the minimum and the maximum of the expected outcomes of all subpopulations as a function of the score. The scores are S j = (j − 0.5)/m) for j = 1, 2, . . . , m, thus less concentrated near 0 than near 1. For the indices i 1 , i 2 , . . . , i n of the subpopulation, we generate a random permutation of the integers 1, 2, . . . , m, retain the first n, and then sort them. The total number of scores considered is m = 50,000, and the number of subpopulation indices is n = 2,500.
For the fourth example, corresponding to Figures 7 and 8, we consider weighted samples (whereas the three previous examples all used unweighted or uniformly weighted data, in which all weights would be equal). We start by constructing the full population as a mixture of subpopulations whose probabilities of success oscillate between 0 and 0.2, and with the subpopulation being analyzed selected uniformly at random such that the probability of success for each observation is 0. The scores are S j = (j − 0.5)/m for j = 1, 2, . . . , m, hence equispaced between 0 and 1. The total number of scores considered is m = 50,000, and the number of subpopulation indices is n = 2,500. All observations just mentioned we weight equally with weight 1, and then introduce three outliers near the score 0.75: one being an observation from the subpopulation, altering its probability of success to 1 and its weight to 0.02n, and the other two being the two members from the full population not belonging to the subpopulation whose scores are adjacent to the score for the observation from the subpopulation, setting the probabilities of success for these two members to 0 and 1, both with the same weight 0.002m.
The captions of the figures comment on the numerical results displayed.
Remark 6. If (as in Figures 7 and 8) the weight W i for some single observation is extraordinarily disproportionately high, then any bin containing that observation will be strongly biased toward only that individual observation, blind to the other observations in the bin. Such a phenomenon can lead to misleading behavior akin to Simpson's Paradox of [9] in the canonical binned reliability diagrams; for instance, the subpopulation may very well attain results R i1 , R i2 , . . . , R in that are less than all others in the full population except for the one disproportionately heavily weighted one, yet if the results R i1 , R i2 , . . . , R in for the subpopulation are greater than the result R i corresponding to the one disproportionately large weight W i , then any bin in reliability diagrams containing that heavily weighted observation will show that the subpopulation attains greater results than the weighted average in the full population. In contrast, a single heavily weighted observation affects only one subpopulation observation in the plots of cumulative differences, merely introducing a jump in the constant offset at the score corresponding to the one observation (and the jump will be disproportionately large only in proportion to the weight of the corresponding observation from the subpopulation). Figure 7 illustrates related behavior, and Figure 22 below exhibits similar behavior for the data from the U.S. Census Bureau analyzed there. Comparing the subpopulation against the full population via cumulative differences is thus similar to subtracting off baseline rates for calibration (that is, similar to "de-trending").

ImageNet
This subsection analyzes the standard training data set "ImageNet-1000" of [10]. Each of the thousand labeled classes in the data set forms a natural subpopulation to consider; each class considered consists of n = 1,300 images of a particular noun (such as a "sidewinder/horned rattlesnake," a "night snake," or an "Eskimo Dog or Husky"). Some classes in the data set contain fewer than 1,300 images, so that the total number of members of the data set is m = 1,281,167, but each subpopulation we analyze below corresponds to a class with 1,300 images. The particular classes reported below are cherry-picked to illustrate a variety of problems that can afflict the classical reliability diagrams; not all classes exhibit such problems, nor are these all the classes that exhibit problems (many, many others have similar issues). The images are unweighted (or, equivalently, uniformly weighted), not requiring the methods of Subsection 2.5 above. To generate the corresponding figures, we calculate the scores S 1 , S 2 , . . . , S m using the pretrained ResNet18 classifier of [11] from the computer-vision module, "torchvision," in the PyTorch software library of [12]; the score for an image can be either {1} the probability assigned by the classifier to the class predicted to be most likely or {2} the corresponding negative log-likelihood (that is, the negative of the natural logarithm of the probability), with the scores randomly perturbed by about one part in 10 8 to guarantee their uniqueness. For i = 1, 2, . . . , m, the result R i corresponding to a score S i is R i = 1 when the class predicted to be most likely is the correct class, and R i = 0 otherwise. The figures below omit display of reliability diagrams whose bins are equispaced with respect to the scores in the typical case that these reliability diagrams are so noisy as to be useless (the figures do display the diagrams when they are not too noisy). Figures 9-19 present several examples, listing in the captions the associated names of the classes for the subpopulations. Figure 20 illustrates the utility of the zooming proposed in Remark 3 above. In all figures, the plots of cumulative differences depict all observed phenomena clearly.

American Community Survey of the U.S. Census Bureau
This subsection analyzes the latest (year 2019) microdata from the American Community Survey of the U.S. Census Bureau; 4 specifically, we consider the full population to be the members from all counties in California put together, and consider the subpopulation to be the observations from an individual county. The sampling in this survey is weighted; we retain only those members whose weights ("WGTP" in the microdata) are nonzero, and discard any member whose household personal income ("HINCP") is zero or for which the adjustment factor to income ("ADJINC") is reported as missing. For the scores, we use the logarithm to base 10 of the adjusted household personal income (the adjusted income is "HINCP" times "ADJINC," divided by one million when "ADJINC" omits its decimal point in the integer-valued microdata), randomly perturbing the scores by about one part in 10 8 to guarantee their uniqueness.

Future outlook
This subsection suggests generalizations suitable for future research. The present paper plots the cumulative differences between the results of a subpopulation and the results of the full population, accumulated as a function of a scalar real-valued score. If instead the scores to be accumulated are from a Euclidean or Hilbert space of more than one dimension, a natural generalization is to impose a total order on the scores via a space-filling curve such as the Peano or Hilbert curves, effectively replacing the space of more than one dimension with a one-dimensional ordering. Space-filling curves constructed via quad-trees in two-dimensional space or oct-trees in three-dimensional space can be especially efficient. Google's S2 Geometry Library, which drives Google Maps, Foursquare, MongoDB, et al., employs a Hilbert curve. 5 The present paper focuses on deviations of a subpopulation from the full population; comparing the subpopulation to the full population is always legitimate, as binning or interpolating from the scores of the full population to those of the subpopulation happens at a scale finer than the subpopulation's sampling, and there is always at least one   Figure 4 displays the ground-truth reliability diagram. Distinguishing random fluctuations from real variations is difficult in the reliability diagrams with 50 bins each. The reliability diagrams that each have only 10 bins could be misleading, as the depicted variations in the subpopulation's outcomes are grossly lower than the actual underlying variations as a function of score. The plot of cumulative deviation is far from perfect, yet captures the exact expectations quite well qualitatively and tolerably well quantitatively. The scalar summary statistics have trouble detecting the significant deviation of the subpopulation from the full population. This illustrates a blind spot in the Kolmogorov-Smirnov and Kuiper statistics, namely, they have a hard time detecting oscillatory discrepancies that average away upon summation. Neither the Kolmogorov-Smirnov metric nor the Kuiper metric is very sensitive to high-frequency deviations whose mean is small. score in the full population corresponding to each score in the subpopulation (namely, the same score). In contrast, assessing deviations between two different subpopulations that do not have an explicit pairing for each score can be problematic. Indeed, the ranges of scores for the subpopulations may not even overlap -the scores for one subpopulation can all be significantly less than all the scores for another subpopulation, for instance. Binning or interpolating from one subpopulation to another can be ill-posed. And even when binning or interpolating from one subpopulation to a second subpopulation is well-posed, binning or interpolating from that second subpopulation to the first subpopulation can be ill-posed -the comparison may not be reflexively well-posed. In general, a partial ordering governs the comparisons -we can always analyze deviations of any subpopulation from any larger population containing the subpopulation, but cannot always reliably analyze deviations between two different subpopulations of the same larger population. Applications demanding direct comparison between different subpopulations are currently the subject of intensive investigation, far beyond the scope of the present paper except in the special case that the different subpopulations are paired for each score. One possibility for the general case is to bin both subpopulations being compared to the same finest binning such that both subpopulations have at least one score in each bin; however, interpreting significance in this general case can be tricky. Forthcoming work will treat the special case for which no score associated with any observation from either subpopulation being compared is exactly equal to the score for any other observation from the two subpopulations.

Conclusion
Plotting the cumulative differences between the outcomes for the subpopulation and those for the full population binned to the subpopulation's scores avoids the arbitrary selection of widths for bins or smoothing kernels that the conventional reliability diagrams and calibration plots require. The plot of cumulative differences displays deviation directly as the slope of secant lines for the graph; such slope is easy to perceive independent of any irrelevant constant offset of a secant line. The graph of cumulative differences very directly enables detection and quantification of deviation between the subpopulation and full population, along with identification of the corresponding ranges of scores. The cumulative differences estimate the distribution of deviation fully nonparametrically, letting the data observations speak for themselves (or nearly for themselves -the triangle at the origin helps convey the scale of a driftless random walk's expected random fluctuations). As seen in the figures, the graph of cumulative differences automatically adapts its resolving power to the distributions of deviations and scores for the subpopulation, not imposing any artificial grid of bins or set-width smoothing kernel, unlike the traditional reliability diagrams and calibration plots. The scalar metrics of Kuiper and of Kolmogorov and Smirnov conveniently summarize the overall statistical significance of deviations displayed in the graph.

Declarations
Ethics approval and consent to participate Not applicable.   Figure 6 displays the ground-truth reliability diagram. The observed reliability diagrams fail to depict the underlying discontinuous jumps in the subpopulation's expected outcomes as a function of the score. The plot of cumulative deviation succeeds in resolving some of the corresponding cusps, but does exhibit significant random fluctuations nearly as high as a quarter of the height of the triangle at the origin. The scalar summary statistics successfully detect the statistically significant deviation of the subpopulation from the full population.

Availability of data and materials
The data sets generated during and/or analyzed during the current study are available in the following repositories: • https://github.com/facebookresearch/fbcdgraph (for all synthetic data sets) • http://image-net.org/download-images (for ImageNet)    Figure 8 displays the ground-truth reliability diagram. The cumulative plot displays the distinguished observation from the subpopulation as a straight, steep jump at its score around 0.75; the constant slope of that steep jump shows that the corresponding high deviation between the subpopulation and the full population is due to a single highly weighted observation. This single observation has no effect on the slopes in the rest of the cumulative plot, whereas the few highly weighted observations dramatically (perhaps misleadingly?) alter the bins around scores of 0.75 in the observed reliability diagrams. The scalar summary statistics report statistically significant deviation of the subpopulation from the full population, though the steep jump in the cumulative plot reduces the effectiveness of the scalar statistics.      Figure 9, none of the reliability diagrams with the same number of subpopulation scores per bin is able to smooth away the irrelevant variations while simultaneously capturing the severe deviation at the highest probabilities (however, the reliability diagram with 10 bins that are equispaced in probabilities captures everything nicely). The scalar summary statistics very successfully detect the statistically highly significant deviation of the subpopulation from the full population.     Figure 11, uniform resolution in the binned plots (whether with respect to observations or to scores) is insufficient. The scalar summary statistics extremely successfully detect the statistically highly significant deviation of the subpopulation from the full population.  Figure 13: Sunglasses, with scores being the probabilities; n = 1,300; Kuiper's statistic is 0.03597/σ = 2.935, Kolmogorov's and Smirnov's is 0.03365/σ = 2.745. The reliability diagrams with only 10 or 30 bins each fail to resolve the very high deviation for probabilities greater than 0.9 (unlike the diagram with 50 bins) -the smaller numbers of bins average away interesting behavior, without warning. The scalar summary statistics detect some statistically significant deviation, yet both are blind to the serious deviation for the highest scores that the plot of cumulative differences displays prominently; the steep drop at the highest scores in the cumulative plot has little to no effect on the Kolmogorov-Smirnov or Kuiper metrics, unfortunately.   Figure 13 -the reliability diagrams with only 10 or 30 bins each underplay the very high deviation for the highest probabilities, smoothing away big deviation without warning. As in Figure 13, the scalar summary statistics detect statistically significant deviation, while both are blind to the significant deviation for the highest scores that the plot of cumulative differences highlights; the steep drop at the highest scores in the cumulative plot has essentially no impact on the Kolmogorov-Smirnov or Kuiper metrics, unfortunately.   , with scores being the negative log-likelihoods; n = 1,300; Kuiper's statistic is 0.01144/σ = 2.856, Kolmogorov's and Smirnov's is 0.01144/σ = 2.856. The deviation of the subpopulation below the full population that occurs for scores greater than 1.5 is apparent only in the cumulative plot or in the (very noisy) reliability diagrams whose bins are roughly equispaced along the scores. The scalar summary statistics detect somewhat significant deviation, but not as much as would be possible if the Kolmogorov-Smirnov and Kuiper metrics were to take fully into account the steep drop that occurs in the plot of cumulative differences.   Figure 16, the deviation of the subpopulation below the full population that occurs for probabilities less than 0.1 is apparent only in the cumulative plot or in the very noisy reliability diagrams whose bins are roughly equispaced along the probabilities. The scalar summary statistics again detect somewhat significant deviation, though not as much as if the Kolmogorov-Smirnov and Kuiper metrics could better account for the steep drop that happens in the plot of cumulative differences.     Figure 18, the reliability diagrams whose bins each contain roughly the same number of subpopulation scores either miss the severest deviations or are noisier compared to the other reliability diagrams for this example; so sometimes choosing bins that are approximately equispaced along the scores works better than choosing bins that each contain a similar number of subpopulation scores. The scalar summary statistics extremely successfully detect the statistically highly significant deviation of the subpopulation from the full population.  Figure 9, the middlemost plots are for the night snake (Hypsiglena torquata) of Figure 11, and the lowermost plots are for the sidewinder/horned rattlesnake (Crotalus cerastes) of Figure 18; each plot on the right zooms in on the origin of the plot to its left, adjusting the height of the triangle at the origin, the number n of observations, and the Kuiper and Kolmogorov-Smirnov statistics to reflect the smaller range of scores on the horizontal axis.   Figure 22: Humboldt County, reporting whether no one in the household 14 or over speaks English very well, with scores being log 10 of the adjusted household income; n = 583; Kuiper's statistic is 0.1028/σ = 6.062, Kolmogorov's and Smirnov's is 0.1028/σ = 6.062. A single, highly weighted outlying observation from the subpopulation corrupts a bin at the highest scores in each reliability diagram. The cumulative plot includes this outlying observation, too, but displays the observation as an unmistakable steep jump in the plotted curve; the constant slope of that steep jump shows that the corresponding high deviation between the subpopulation and the full population is due to a single highly weighted observation. This single observation has no effect on the slopes in the rest of the cumulative plot, so this problematic observation corrupts only the reliability diagrams and not the plot of cumulative differences -an analogue of Simpson's Paradox of [9] afflicts the reliability diagrams but not the cumulative plot. The scalar summary statistics report highly statistically significant deviation commensurate with all the plots, albeit slightly less due to the steep jump in the cumulative plot. These behaviors are analogous to those displayed in Figure 7 above (see also Remark 6 above).  : Orange County, reporting whether the household has high-speed access to the internet, with scores being log 10 of the adjusted household income; n = 10,680; Kuiper's statistic is 0.02499/σ = 6.013, Kolmogorov's and Smirnov's is 0.02484/σ = 5.979. The severe deviation at the very lowest scores is misleadingly underestimated in the reliability diagrams with 10 or 20 bins each, even as compared to the diagrams with around 100 bins each. However, the diagrams with around 100 bins each are far too noisy for other scores. Only the plot of cumulative differences resolves the severe deviation at the lowest scores while being informative at the other scores, too. The scalar summary statistics indicate that the overall deviation is highly statistically significant.  is possible yet difficult in the reliability diagrams, while being unmistakable in the cumulative plot; the reliability diagrams with enough bins do reflect the corresponding deviation at the highest scores, but are hard to interpret without the accompanying cumulative plot. The scalar summary statistics report very highly statistically significant deviation, largely since the number of observations from this largest county in California is so large.  : Napa County, reporting whether the household has access via a satellite dish to the internet, with scores being log 10 of the adjusted household income; n = 679; Kuiper's statistic is 0.02761/σ = 2.259, Kolmogorov's and Smirnov's is 0.02695/σ = 2.205. The intense deviation around scores of 4.6 is apparent in the reliability diagrams with 10 and 20 bins each, but the latter resolves the spike much better while being unfortunately too noisy for many other scores. The plot of cumulative differences resolves the sharp jump around scores of 4.6 without detracting from the display at other scores. The scalar summary statistics report only very mildly statistically significant deviation, unable to fully account for the high deviation around scores of 4.6.

A.1 Introduction to assessing calibration
This appendix treats calibration, specifically, we consider n observations R 1 , R 2 , . . . , R n of the outcomes of independent Bernoulli trials, and would like to test the hypothesis that R j is drawn from a Bernoulli distribution whose probability of success is S j , that is, for all j = 1, 2, . . . , n, where the corresponding probabilities of success are S 1 , S 2 , . . . , S n ; following the usual conventions, R j = 1 when the outcome is a success and R j = 0 when the outcome is a failure. We view R 1 , R 2 , . . . , R n (but not S 1 , S 2 , . . . , S n ) as random. Without loss of generality, we order the success probabilities (preserving the pairing of R j with S j for every j) such that S 1 ≤ S 2 ≤ · · · ≤ S n , ordering any ties at random, perturbed so that S 1 < S 2 < · · · < S n . As in Section 1, the classical methods require partitioning the unit interval [0, 1] into disjoint intervals with endpoints B 1 , B 2 , . . . , B such that 0 < B 1 < B 2 < · · · < B −1 < B = 1. We can then form the averages for k = 1, 2, . . . , , under the convention that B 0 < 0. We also calculate the average success probabilities in the bins for k = 1, 2, . . . , , under the same convention that B 0 < 0. A graphical method for assessing calibration is then to scatterplot the pairs (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X , Y ), along with the line connecting the origin (0, 0) to the point (1, 1); a pair (X k , Y k ) lies on that line when its calibration is ideal, as then Y k = X k . This graphical method for assessing the calibration or reliability of probabilistic predictions is known as a "reliability diagram" or "calibration plot," as reviewed, for example, by [13]. Copious examples of such reliability diagrams are available in the figures below, as well as in the works of [1], [2], [6], [7], [8], [14], [15], [16], [17], and many others; those works consider applications ranging from weather forecasting to medical prognosis to fairness in criminal justice to quantifying the uncertainty in predictions of artificial neural networks. An approach closely related to reliability diagrams is to smooth over the binning using kernel density estimation, as discussed by [7], [8], and others. As in Section 1, two common choices of the bins whose endpoints are B 1 , B 2 , . . . , B are {1} to make B 1 , B 2 , . . . , B be equispaced, and {2} to make the number of success probabilities that fall in the kth bin, that is, #{j : B k−1 < S j ≤ B k }, be the same for all k (aside from the rightmost bin, that for k = , if n is not perfectly divisible by ). In contrast to the classical approach, the methods of the following subsections avoid binning. Recently, [18] and [19] have forcefully reiterated serious problems with existing binned methodologies.
There have been a number of proposals to measure calibration via Kolmogorov-Smirnov statistics, without binning, including by [18] and [13]. Section 3.2 of [15], Chapter 8 of [8], and Figure 1 of [18] also point to the utility of cumulative reliability diagrams and plots somewhat similar to those in the present paper. The particular plots proposed below focus on calibration specifically, encoding miscalibration directly as the slopes of secant lines for the graphs. Such plots lucidly depict miscalibration with significant quantitative precision. Popular graphical methods for assessing calibration appear not to leverage the key to the approach advocated below, namely that slope is easy to assess visually even when the constant offset of the graph (or portion of the graph under consideration) is arbitrary and meaningless.
To illustrate with an example, Figure 27 displays both the classical reliability diagrams as well as the plot of cumulative differences proposed below. Subsubsection A.3.2 below describes the figure in detail. The lowermost two rows of Figure 27 are the classical diagrams, with n = 1,281,167; there are = 1,000 bins for each diagram in the middlemost row and = 100 for each diagram in the lowermost row. The bins are equispaced along the probabilities in the leftmost reliability diagrams, whereas each bin contains the same number of probabilities from S 1 , S 2 , . . . , S n in the rightmost reliability diagrams. The light gray lines indicate "error bars" constructed via bootstrapping, as elaborated in Subsection A.3 below.
The topmost row of Figure 27 displays the cumulative plot. Leaving elucidation of the cumulative plots and their construction to Subsection A.2 below, we just point out here that miscalibration across an interval is equal to the slope of the secant line for the graph over that interval, aside from expected stochastic fluctuations detailed in Subsubsection A.2.4 below. Steep slopes correspond to substantial miscalibration across the ranges of probabilities where the slopes are steep, with the miscalibration over an interval exactly equal to the expected value of the slope of the secant line for the graph over that interval. In Figure 27, the cumulative plot reveals a curious change-point in calibration that is difficult to notice in the conventional reliability diagrams, and the Kolmogorov-Smirnov and Kuiper metrics conveniently summarize the statistical significance of the overall deviation across all scores, in accordance with Subsubsections A.2.3 and A.2.4 below. The structure of the remainder of this appendix is as follows: Subsection A.2 details the cumulative methods, and Subsection A.3 then illustrates them via several examples. Table 2 gives a glossary of the notation used in the appendices, while Table 1 gives a glossary of the notation used prior to the appendices.

A.2 Methods for assessing calibration
This subsection provides a detailed mathematical formulation of the cumulative methods, with Subsubsection A.2.1 outlining an approach to large-scale analysis. Subsubsection A.2.2 details the graphical methods. Subsubsection A.2.3 details the scalar summary metrics. Subsubsection A.2.4 treats statistical significance for both the graphical methods and the summary statistics.

A.2.1 High-level strategy for assessing calibration
This subsubsection proposes a process for large-scale data analysis.
As in Subsection 2.1, we can take a two-step approach when there are many data sets to analyze: 1. A screening stage assigning a single scalar summary statistic to each data set (where the size of the statistic measures miscalibration).

2.
A detailed drill-down for each data set whose scalar summary statistic is large, drilling down into the miscalibration's variation as a function of the probability of success.
The drill-down relies on graphical display of miscalibration; the scalar statistic for the first stage simply summarizes the overall miscalibration across all success probabilities, as either the maximum absolute deviation of the graph from the ideal or the size of the range of deviations. Thus, for each data set, both stages are based on a graph; the first stage collapses the graph into a single scalar summary statistic. The following subsubsection constructs the graph.

A.2.2 Graphical method for assessing calibration
This subsubsection details the construction of cumulative plots for calibration.
The cumulative response is for k = 1, 2, . . . , n. Kuiper's statistic is 0.03306/σ = 111.7, Kolmogorov's and Smirnov's is 0.03306/σ = 111.7. The reliability diagram with 100 bins, each with roughly the same number of observations, looks best among the reliability diagrams; the reliability diagrams with 1,000 bins each display unreal stochastic variations. Only the cumulative plot conveniently reveals that a third of all observations (specifically, those with probabilities of at least 0.97) are well-calibrated. The scalar summary statistics report profoundly statistically significant miscalibration, courtesy of the large number of observations (the actual effect size is more modest, as seen by the values without dividing by σ).
A plot of F k −F k as a function of k displays miscalibration directly as slopes that deviate significantly from 0; indeed, the increment in the expected difference thus, on a plot with the values for k spaced 1/n apart, the slope from j = k − 1 to j = k is for k = 1, 2, . . . , n. The miscalibration for success probabilities near S k is substantial when ∆ k is significantly nonzero, that is, when the slope of the plot of F k −F k deviates significantly from horizontal over a significantly long range. To emphasize: miscalibration over a contiguous range of S k is the slope of the secant line for the plot of F k −F k as a function of k n over that range, aside from expected stochastic fluctuations. The following subsubsection reviews two metrics that summarize how much the graph deviates from 0 (needless to say, if the slopes of the secant lines are all nearly 0, then the whole graph cannot deviate much from 0). Considering these metrics, Subsubsection A. Two standard metrics for miscalibration over the full range of success probabilities that account for expected random fluctuations are that due to Kolmogorov and Smirnov, the maximum absolute deviation and that due to Kuiper, the size of the range of the deviations where F 0 = 0 =F 0 ; Remark 1 of Subsection 2.3 explains the reason for including F 0 andF 0 . Under the null hypothesis (41), the distributions of G and D are known, and can form the basis for tests of statistical significance, as described, for example, by Section 14.3.4 of [3]. The distributions are easy to calculate under the null hypothesis (41), as then the sequence F 1 , F 2 , . . . , F n defined in (44) is a random walk with fully specified transition probabilities S 1 , S 2 , . . . , S n . For assessing statistical significance (rather than overall effect size), G and D should be divided by σ, where σ is 1/n times the standard deviation of the sum of independent Bernoulli variates whose probabilities of success are S 1 , S 2 , . . . , S n , that is, the following remark explains why.

Remark 7.
Under the null hypothesis (41) that assumes that the outcomes R 1 , R 2 , . . . , R n are drawn independently from Bernoulli distributions whose probabilities of success are S 1 , S 2 , . . . , S n , the expected value of G/σ is less than or equal to the expected value of the maximum (over a subset of the unit interval [0, 1]) of the absolute value of the standard Brownian motion over the unit interval [0, 1], in the limit n → ∞. As discussed by [5] (with x = 0 and D = 1 in Formula 46 from the associated arXiv publication 6 . . . or see immediately before Remark I on the relevant StackExchange thread 7 ), the expected value of the maximum of the absolute value of the standard Brownian motion over [0, 1] is π/2 ≈ 1.25. The discussion by [5] immediately following Formula 44 from the associated arXiv publication 8 shows that the probability distribution of the maximum of the absolute value of the standard Brownian motion over [0, 1] is sub-Gaussian, decaying past its mean π/2 ≈ 1.25. Values of G/σ much greater than 1.25 imply serious miscalibration, while values of G/σ close to 0 imply that G did not detect any statistically significant miscalibration. Needless to say, similar remarks pertain to D.

A.2.4 Significance of stochastic fluctuations for assessing calibration
This subsubsection discusses statistical significance both for the graphical methods of Subsubsection A.2.2 and for the summary statistics of Subsubsection A.2.3. The plot of F k −F k as a function of k/n automatically includes some "error bars" courtesy of the discrepancy F k −F k fluctuating randomly as the index k increments. Of course, the standard deviation of a Bernoulli variate whose expected value is S j is S j (1 − S j ) -smaller both for S j near 0 and for S j near 1. To indicate the size of the fluctuations, the plots should include a triangle centered at the origin whose height above the origin is 2/n times the standard deviation of the sum of independent Bernoulli variates with success probabilities S 1 , S 2 , . . . , S n ; thus, the height of the triangle above the origin (where the triangle itself is centered at the origin) is 2 n j=1 S j (1 − S j )/n. The expected deviation from 0 of |F k −F k | (at any specified value for k) is no greater than this height, under the assumption that the responses R 1 , R 2 , . . . , R n are draws from independent Bernoulli distributions with the correct success probabilities S 1 , S 2 , . . . , S n , that is, under the null hypothesis (41). The triangle is similar to the classic confidence bands around an empirical cumulative distribution function given by Kolmogorov and Smirnov, as reviewed by [4]. To generate the figures in Subsubsection A.3.1, we specify values for S 1 , S 2 , . . . , S n and for P 1 , P 2 , . . . , P n differing from S 1 , S 2 , . . . , S n , then independently draw R 1 , R 2 , . . . , R n from the Bernoulli distributions with parameters P 1 , P 2 , . . . , P n , respectively. Ideally the plots would show how and where P 1 , P 2 , . . . , P n differs from S 1 , S 2 , . . . , S n . Appendix B considers the case in which P k = S k for all k = 1, 2, . . . , n.

A.3 Results and discussion for assessing calibration
The top rows of the figures with three rows plot F k −F k from (44) and (45) as a function of k/n, with the rightmost plot displaying its noiseless expected value rather than using the observations R 1 , R 2 , . . . , R n ( Figure 27 omits the rightmost plot since the expected values are unknown in that case). In each of these plots, the upper axis specifies k/n, while the lower axis specifies S k for the corresponding value of k. The lowermost two rows of the figures with three rows plot the pairs (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X , Y ) from (42) and (43), with the rightmost plots using an equal number of observations per bin. The dark black lines and points of the left and right plots in the lowermost two rows of Figures 28, 30, and 32 are in fact identical, since S 1 , S 2 , . . . , S n are equispaced for those examples (so equally wide bins contain equal numbers of observations). The figures with only a single diagram plot pairs (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ) from (42) and (43), but this time using their noiseless expected values (S 1 , P 1 ), (S 2 , P 2 ), . . . , (S n , P n ) instead of using the random observations R 1 , R 2 , . . . , R n .
Perhaps the simplest, most straightforward method to gauge uncertainty in the binned plots is to vary the number of bins and observe how the plotted values vary. All figures displayed employ this method, with the number of bins increased in the second rows of plots beyond the number of bins in the third rows of plots. The figures also include the "error bars" resulting from one of the bootstrap resampling schemes proposed by [14], obtained by drawing n observations independently and uniformly at random with replacement from (S 1 , R 1 ), (S 2 , R 2 ), . . . , (S n , R n ) and then plotting (in light gray) the corresponding reliability diagram, and repeating for a total of 20 times (thus displaying 20 gray lines per plot). The chance that all 20 lines are unrepresentative of the expected statistical variations would be roughly 1/20 = 5%, so plotting these 20 lines corresponds to approximately 95% confidence. An alternative is to display the bin frequencies as suggested, for example, by [6]. Other possibilities often involve kernel density estimation, as suggested, for example, by [7] and [8]. All such methods require selecting widths for the bins or kernel smoothing; avoiding having to make what is a necessarily somewhat arbitrary choice is possible by varying the widths, as done in the plots of the present paper. Chapter 8 of [8] comprehensively reviews the extant literature.
We may set the widths of the bins such that either {1} the average of S k for k in each bin is approximately equidistant from the average of S k for k in each neighboring bin or {2} the range of k for every bin has the same width. Both options are natural; the first is the canonical choice, whereas the second ensures that error bars would be similarly sized for every bin. The figures display both possibilities, with the first on the left and the second on the right. Setting the number of bins together with either of these choices fully specifies the bins. As discussed earlier, we vary the number of bins since there is no perfect setting -using fewer bins offers estimates with higher confidence yet limits the resolution for detecting miscalibration and for assessing the dependence of calibration as a function of S k .

A.3.1 Synthetic examples for assessing calibration
In this subsubsection, the examples are sampled randomly from various statistical models so that the underlying "ground-truth" is known explicitly. In all cases with n = 10,000, that is, in Figures 28, 34, and 40, the scalar summary statistics detect extremely statistically significant miscalibration. Moreover, in all cases with n = 1,000, that is, in Figures 30, 36, and 42, the scalar summary statistics detect highly statistically significant miscalibration. In all cases with n = 100, that is, in Figures 32, 38, and 44, the scalar summary statistics detect some statistically significant miscalibration, though not with nearly as much confidence as when n = 1,000 or (even more starkly) as when n = 10,000. Naturally, the Kolmogorov-Smirnov and Kuiper statistics get larger for miscalibration that is solely overcalibration (and the same would be true of solely undercalibrated data), such as that in Figures 34-39.

A.3.2 ImageNet examples for assessing calibration
This subsubsection analyzes the standard training data set "ImageNet-1000" of [10]. Each of the thousand labeled classes in the data set consists of about 1,300 images of a particular noun, so that the total number of members of the data set is n = 1,281,167. To generate the corresponding plots, displayed in Figure 27, we calculate the scores S 1 , S 2 , . . . , S n using the pretrained ResNet18 classifier of [11] from the computer-vision module, "torchvision," in the PyTorch software library of [12]; the score for an image is the probability assigned by the classifier to the class predicted to be most likely, with the scores randomly perturbed by about one part in 10 8 to guarantee their uniqueness. For j = 1, 2, . . . , n, the result R j corresponding to a score S j is R j = 1 when the class predicted to be most likely is the correct class, and R j = 0 otherwise. The cumulative plot works nicely, as discussed in the caption of Figure 27.  Figure 29 displays the ground-truth reliability diagram. All plots, whether cumulative or conventional, appear to work well.    Figure 31 displays the ground-truth reliability diagram. All plots, whether cumulative or conventional, appear to work well.   Figure 33 displays the ground-truth reliability diagram. The conventional plots become increasingly problematic as n reduces to 100 from 10,000 and 1,000 in Figures 28 and 30, whereas the cumulative plot still detects roughly the right level of miscalibration for 0 S k 0.2 and 0.8 S k 1; the cumulative plot indicates that too little data is available for 0.2 S k 0.8 to detect any statistically significant miscalibration in that range of S k (note the size of the triangle centered at the origin of the plot).   Figure 35 displays the ground-truth reliability diagram. All plots, whether cumulative or conventional, work well enough, though the reliability diagrams might be mistakenly misleading relative to the exact expectations, at least without diligent attention to the significant variation with the number of bins.    Figure 37 displays the ground-truth reliability diagram. As in Figure 34, all plots work sufficiently well, though the observed reliability diagrams may be misleading relative to the exact expectations, at least without careful attention to the variation with the number of bins.   Figure 39 displays the ground-truth reliability diagram. The plots, whether cumulative or conventional, reveal similar information here, though the reliability diagrams with an equal number of observations per bin provide more reliable estimates than the other reliability diagrams. The cumulative plot is perhaps the easiest to interpret: the miscalibration is significant for 0.1 S k 0.23 and 0.27 S k 0.6, with about the correct amount of miscalibration (the amount is correct since the secant lines have the expected slopes).   Figure 41 displays the ground-truth reliability diagram. The cumulative plot captures more of the oscillations in the miscalibration, as do to some extent the reliability diagrams with an equal number of observations per bin; however, the variations in the reliability diagrams could be difficult to interpret without access to the ground-truth exact expectations.    Figure 43 displays the ground-truth reliability diagram. As in Figure 40, the cumulative plot resolves more of the oscillations in the miscalibration, as do somewhat the reliability diagrams with an equal number of observations per bin; that said, the variations in the reliability diagrams could be hard to interpret without access to the exact expectations.

B Cautions
In addition to noting the size of the triangle at the origin, interpreting plots of cumulative differences requires careful attention to one caveat: avoid hallucination of minor deviations! The sample paths of random walks and Brownian motion can look surprisingly non-random (drifting?) quite often for short stints. The most trustworthy detections of deviation are long ranges (as a function of k/n in the unweighted case or of A k in the weighted case) of steep slopes for F k −F k . The triangles centered at the origins of the plots give a sense of the length scale for variations that are statistically significant.
For all plots, whether cumulative or classical, bear in mind that even at 95% confidence, one in twenty detections is likely to be false. So, if there are a hundred bins, each with a 95% confidence interval, the reality is likely to violate around 5 of those confidence intervals. Beware when conducting multiple tests of significance (or be sure to adjust the confidence level accordingly).
Figures 46-51 illustrate these cautions; these figures are analogous to those presented in Appendix A, but with the observations drawn from the same predicted probabilities of success used to generate the graphs, so that the discrepancy from perfect calibration should be statistically insignificant. More precisely, Figures 46-51 all set S k to be proportional to (k − 0.5) 2 and draw R 1 , R 2 , . . . , R n from independent Bernoulli distributions with expected success probabilities S 1 , S 2 , . . . , S n , respectively; this corresponds to setting P k = S k for all k = 1, 2, . . . , n, in the numerical experiments of Appendix A. Figures 46, 48, and 50 consider n = 10,000, n = 1,000, and n = 100, respectively. Please note that the ranges of the vertical axes for the top rows of plots are tiny. The leftmost topmost plots in Figures 46, 48, and 50 look like driftless random walks; in fact, they really are driftless random walks. The variations of the graphs are comparable to the heights of the triangles centered at the origins. Comparing the second rows with the third rows shows that the deviations from perfect calibration are consistent with expected random fluctuations. Indeed, all plots in this appendix depict only small deviations from perfect calibration, as expected (and as desired). None of the scalar summary statistics significantly exceeds its expected value under the null hypothesis of (41).