Large scale analysis of violent death count in daily newspapers to quantify bias and censorship

Casolino, Marco

doi:10.1186/s40537-020-00338-1

Research
Open access
Published: 11 August 2020

Large scale analysis of violent death count in daily newspapers to quantify bias and censorship

Marco Casolino ORCID: orcid.org/0000-0001-6067-5104^1,2,3

Journal of Big Data volume 7, Article number: 60 (2020) Cite this article

2418 Accesses
1 Citations
11 Altmetric
Metrics details

Abstract

In this work we develop a series of techniques and tools to determine and quantify the presence of bias and censorship in newspapers. These algorithms are tested analyzing the occurrence of keywords ‘killed’ and ‘suicide’ (‘morti’’, ‘suicidio’ in Italian) and their changes over time, gender and reported location on the complete online archives (42 million records) of the major US newspaper (The New York Times) and the three major Italian ones (Il Corriere della Sera, La Repubblica, La Stampa). Using these tools, since the Italian language distinguishes between the female and male cases, we find the presence of gender bias in all Italian newspapers, with reported single female deaths to be about one-third of those involving single men. Analyzing the historical trends, we show evidence of censorship in Italian newspapers both during World War 1 and during the Italian Fascist regime. Censorship in all countries during World Wars and in Italy during the Fascist period is a historically ascertained fact, but so far there was no estimate on the amount on censorship in newspaper reporting: in this work we estimate that about 75% of domestic deaths and suicides were not reported. This is also confirmed by statistical analysis of the distribution of the least significant digit of the number of reported deaths. We also find that the distribution function of the number of articles vs. the number of deaths reported in articles follows a power law, which is broken (with fewer articles being written) when reporting on few deaths occurring in foreign countries. The lack of articles is found to grow with geographical distance from the nation where the newspaper is being printed. Whereas the assessment of the truth of a single article or the debunking of what are now called ‘fake news’ requires specific fact-checking and becomes more difficult as time goes by, these methods can be used in historical analysis and to evaluate quantitatively the amount of bias and censorship present in other printed or online publication and can thus contribute to quantitatively assess the freedom of the press in a given country. Furthermore, they can be applied in wider contexts such as the evaluation of bias toward specific ethnic groups or specific accidents.

Introduction

Violent death is a dramatic event that can be objectively quantified in terms of the number of lives lost. Words such as ‘killed’, ‘dead’, ‘casualty’, ‘suicide’ etc... can thus be used to assess the presence of bias or censorship in reporting incidents involving specific locations, groups of people, or time.

Coverage of death in the mass media has been studied for the insight it offers on how societies perceive this dramatic event [1]; at the same time, selective death reporting can be used to shape and distort public opinion. A 6-months period study [2] of 1975 issues of the Register Guard and the Standard Times showed that newspapers tend to over-report more news-worthy death causes and overlook others that are considered less interesting by the public. The work considered the number of occurrences, the number of deaths reported, and the text surface area of the articles dealing with death and compared them with the statistical occurrences, finding that all forms of disease were under-reported whereas violent or catastrophic events were overrepresented. An even more biased selection along these lines was found analyzing the news selection by television and radio [3]. Studies on the presence of geographical bias in reporting accidental deaths have been conducted manually on small data sets: a 51-issues study (two months) of two German (Frankfurter Allgemeine Zeitung and S$\ddot{u}{} \textit{ddeutsche}$Zeitung) and two Australian newspapers (The Australian and The Sydney Morning Herald) [4] have shown a weak relationship with social proximity in the number of articles reporting deaths. Similarly, a 6-months study of the German tabloid Die Abendzeitung [5] reported evidence of a correlation between the number of accidental deaths and the length of the article as well as a negative correlation of the distance between the town where the newspaper was printed and the event reported. In this case, the analysis involved the study of the amount of printed space occupied by an article reporting of a fatal event. The authors did not find any significant correlation between the area taken by the article and the number of deaths reported [6] but showed an inverse relationship (linear in logarithmic scale) of the area of printed text normalized to the number of victims and the distance (in km) of the event from Munich, the town where the Die Abendzeitung was printed. They also found a minimum threshold of victims—increasing with distance—needed to elicit a report in the newspaper. They thus likened the response of the paper to the logarithmic response between stimulus strength and response intensity in human perception [7].

On the opposite, [8] analyzed the occurrence of pictures in Time and Newsweek magazines. The study involved ten issues in three years for each magazine, for a total of 60 issues. The work involved measuring the area occupied by the picture, determine the country of origin of the picture, and classifying its violent or non-violent nature. The study found that—although domestic pictures were more common than foreign ones ($Time=66.4\%$, Newsweek=71.2%)—images depicting death were printed more often when the event took place abroad ($Time_{foreign}$: $31\%$; Newsweek$_{foreign}$: $25\%$) compared to when it occurred in domestic USA ($Time_{{domestic}}$: $12\%$; Newsweek$_{domestic}$: $13\%$).

A more recent study [9] showed that direct pictographic representations of death are very rare in western newspapers, amounting only to $4.5\%$ of the articles (997 stories in a two-month time frame, 357 with pictures).

Most of these studies have been performed by manual scanning of the newspapers or magazines, therefore restricting the scope of the possible analysis. With the digitization of printed literature, it is possible to perform systematic analysis on the entire archives of newspapers, magazines, and books. A large-scale analysis of the number of occurrences of notable personalities in books has shown evidence of censorship of various individuals during the Nazi regime [10]. The books have been digitally converted via OCR (Optical Character Recognition) [11] as part of the Google effort to digitize books and the occurrence of individual names has been compared with the lexicon of the various languages [12].

In this work, we have developed several algorithms that allow a systematic study of the occurrence of the articles involving fatal accidents and their gender/location. This allows to assess the distribution laws and determine the presence of bias and censorship in reporting. We have used daily newspapers since they have the advantage of addressing wider audiences with higher frequency and a longer temporal consistency, often covering events usually ignored by books.

We have applied these algorithms on the full historical archives of the three major Italian newspapers: Il Corriere della Sera (Milan, CDS, 1876–2017), La Repubblica (Rome, REP, 1985–2017), and La Stampa (Turin, STA, 1867–2005) and the major US newspaper: The New York Times (NYT, 1955–2017), searching for the articles containing selected keywords: ‘killed’ (‘morti’ in the Italian language, the mixed-gender plural term, as well as the singular female and male gender, ‘morta’ and ‘morto’ respectively), ‘suicide’ (‘suicidio’). These words have been chosen for their high rate of occurrence (3.1 million articles out of 42 million comprising the newspaper archives) in respect to similar lemmas (dead, casualties...) and correspondence of usage since both are used to indicate violent deaths. Most importantly, in the case of the plural form they contain the number of people involved. Therefore, in addition to a simple word count, it is also possible to study the distribution of the number of people killed, the location, and the category of victims.

The rest of the paper is organized as follows: in “Methods” section is present the description of the archives, the retrieval and data processing steps, and the various algorithms. In “Results and discussion” section are shown the results on the time behaviour, gender, censorship, the scaling law, and the last digit analysis. Perspectives and future work are discussed in “Conclusions” section.

Methods

Figure 1 shows the steps taken to access, parse, and identify the articles, from the access of the online archives to the creation of the various data sets, one for each newspaper and lemma considered. All these tasks are accomplished using Python scripts. The access to the online archive of each newspaper takes a few days to complete over the full time span, since this is mediated by a web-based interface that usually returns ten articles at a time. The storage and handling of the data sets are performed using the Root [13] framework and require a few minutes to complete. This C++ based environment was developed at CERN to deal efficiently with high volumes ($\simeq$ Pbyte) of data produced from accelerator- space- and ground/underground-based detectors. As such, it is especially suited to create the ntuples (Ttrees in Root nomenclature) containing the data extracted from the newspaper articles. All the selections, histogramming, and fitting have also been performed in this environment.

A scheme of the various types of algorithms used to analyze the data sets and the main information they provide is shown in Fig. 2.

Newspaper archives

Only printed editions have been considered and online articles have been excluded for consistency with pre-internet years. The newspapers have freely accessible online archives that cover a major part of their printing time. Out of the $T\simeq 42$ million articles comprising the four newspaper online archives, all the articles which matched the considered lemmas (‘killed’, ‘morti’, ‘morta’, ‘morto’, ‘suicide’, ‘(suicidio’ $\simeq 3.2$ million ) have been extracted. See Table 1 for details on the data size of each newspaper. Details of the newspapers and archives are as follow:

1
The New York Times (NYT) was founded in September 1851, with the online archive (https://www.nytimes.com/search/ ) starting in January 1952. Due to strikes, it was not printed between: December 9, 1962–March 31, 1963 (a western edition is present in the archive); September 17, 1965–October 10, 1965 (an international edition was printed and is present in the archives); August 10, 1978–November 5, 1978 (no editions present in the archives). In 2017, print circulation of the newspaper was 571,500 copies.^{Footnote 1} The newspaper began its publications with about 20,000 articles/year growing gradually to reach 130,000 articles in 2016.
2
Il Corriere della Sera (CDS). The first issue dates back to March 5th 1876 and is available in the online archive (http://archivio.corriere.it/). Originally an evening newspaper, it became a morning paper in 1888 and was issued twice a day since 1892 and up two to three times a day in the first part of the twentieth century, but it has been printed as a daily newspaper for several decades. Its offices were bombed on 14/2/1943. Following the liberation of Italy from Nazi occupation, the Committee of National Liberation (Comitato di Liberazione Nazionale) suspended its publications between 27/4–21/5 1945. It resumed publications under the name ‘Corriere d’Informazione’ and, from 1946, as Il Nuovo Corriere della Sera, with a one-page edition [14, 15]. CDS passes from $\simeq$7,000 articles/year of the first years to $\simeq$125,000 articles/year in 2017. The FAQ of the archive reports that it contains about 2.5 million of pages scanned. In December 2017 it printed 310,275 copies.^{Footnote 2}
3
La Repubblica (REP) begun publications in January 1976, but the online archive

(https://ricerca.repubblica.it/) starts on January 4th, 1984, with the first entry of “morti” occurring on March 4th, 1984. In December 2017 it printed 274,745 copies. REP starts with 24,000/year in 1985 to reach 377,000 articles in 2017.
4
La Stampa (STA). The digitalization of this archive (http://www.archiviolastampa.it/) was performed by the Committee for the Journalistic Information Library (Comitato per la Biblioteca dell’Informazione Giornalistica CBDIG). The archive is released under a Creative Commons license and covers the period from the first issue, February 9th, 1967, when it was called Gazzetta Piemontese (Piedmont Gazette) to December 31st, 2005. Except for a few entries in 1882, all articles up to and included 1909 are referenced in image form with the title ‘Notizia’ (‘News’) and no further information on the contents of the article. Several entries after 1909 are stored in this way, thus reducing the overall time range of the dataset. The Fascist government halted publications of La Stampa in the month of October 1925 as a warning to all publishers. A few days after resuming publications (November 3rd 1925), Alfredo Frassati, Director of the newspaper for 25 years, resigned to be replaced by directors gradually more aligned with the government. Following the liberation of Italy from Nazi occupation, the Committee of National Liberation (Comitato di Liberazione Nazionale) suspended its publications between 28 April and 17 July 1945 (no entries in the database) [16, 17]. In December 2017 it printed 208,615 copies. STA increases from $\simeq$10,000 articles/year to a maximum of 500,000 articles/year in 2001, decreasing at $\simeq$ 200,000 in 2005, the last year present in the archive.
Table 1 Size of the online newspaper archives consulted and of the resulting datasets according to the various selection criteria
Full size table

Lemmas considered

This work concerns the study of the occurrence of the following terms:

‘Killed’ (in English) and ‘morti’ (plural for Italian). Note that the literal translation of killed in Italian language is uccisi, but this word is less frequent than morti and usually (but not always) employed in association with murder. Morti is more often associated with violent accidental death and thus has a wider correspondence in usage with killed. Articles with the keywords casualties/vittime, occur with a lower frequency than killed/morti (10% to 30% less). The English term is more strongly associated with dead or wounded during armed conflicts and has peaks during major US wars (Civil, WW1, WW2, Korea, Vietnam etc...). In case of the Italian newspapers, the correlation is lower, since vittime is more often used for people killed in accidents, natural disasters, and so on.
‘Suicide’ (in English) and ‘suicidio’ (in Italian).
‘morta’ and ‘morto’ (in Italian), respectively feminine and masculine singular form of ‘morti’.

Some methods discussed below can be applied to any lemma, although the distribution law and last digit analysis is usable in other ‘quantitative’ keywords such as ‘dead’, ‘victims’, and the already mentioned casualties, will be the subject of future work.

Construction of the data sets

1.
Query and pre-processing: Each data set has been constructed by querying the web servers of the four newspapers with Python scripts that emulated manual user input (Fig. 1). The input had to be configured for each archive in order to enter the desired keyword and iterate the requests over the time interval of the newspaper archive. The html pages received in reply to the query were then saved locally. Their overall number depends from the number of articles present in each page: we range from the $\simeq 58,000$ of NYT to the $\simeq 300$ of CDS. As mentioned, this data acquisition/preprocessing phase over the newspaper archives takes 5 to 7 days to complete, depending on the newspaper considered.^{Footnote 3}

We could not access the full articles since they are often present only in image form (especially in the years 1850–1910) and the necessary work and resources would have been almost equivalent to those required for the digitization of the newspaper itself. Furthermore, the work would only have resulted in a marginally higher efficiency in event detection and would not have changed the results.

We define an ‘event’ as a newspaper article retrieved upon the query of the keyword killed/morti. In our data sets, each event contains the newspaper name, the date, and the page (not present in NYT) of the article, the text of the title, and—if present—a part of the text of the article.
2.
Parse: The html pages are then subsequently parsed with Python regular expressions that extract the relevant article information (title, snippet, page number, date, etc...). This phase takes a few minutes to complete.
3.
Filter: Sometimes the query only returns a link with no usable text: for instance a title can be empty (especially for issues of the nineteenth century), incomplete, or does not contain the requested word. These events have been discarded (in the case of STA, this restricts the database to the period 1910–2005 for most purposes).
4.
Text analysis: The events are then analyzed to find the number k of people killed. This is searched in numeric, text, and hybrid form. To reduce classification errors, the value of k is searched close to the keyword in the forms: ‘k killed’, killed k’, ‘k attribute killed’, ‘killed attribute k’. Title and text are also parsed to search for the type of event (e.g. car, airplane crash, war, illness), and the people involved (e.g. children, women, ethnicity), that will be the subject of a following paper.
5.
Geographical analysis: Several databases of world places have been used to associate the location event to its country of origin. Cases of homonymy (Florence, Paris, Cairo...) have been resolved assigning the location of the more famous ones at that time. For instance, ‘Cairo’ entries in the years 1861–1865 have been assigned to the US since the location appeared often during the American Civil War.
6.
Exclusions: Duplicate entries, that is articles identical in title, text, and date are then removed, but different reports on the same event are considered as separate since it means that they have been considered worthy of more than one article. See Table 1 for the size of the datasets according to the various selections.

All events reporting deaths of animals (fish, herd, cows, etc.), usually associated with high k have been discarded.

From foreign events we also excluded all articles where the words ‘Italian’ or ‘US’/‘American’ appear associated with foreign countries, thus removing foreign events where citizens from the corresponding newspaper-printing country were involved. This amounted to less than 1% in peacetime.

Once the parsing of the archives is completed, the remaining processing steps in this phase take a few minutes to execute and produce the database/root files for the subsequent analysis. This represents an increase of several orders of magnitude in speed in respect to any traditional, manual-scanning method which had to be necessarily constrained to a limited amount of time and newspaper issues.

Errors

Statistical errors are due to fluctuations in the number of events present in a given bin of a given selection. Fitting algorithms take into account the errors associated with each bin to calculate the errors of the fitted parameters.

Sources of systematic errors can be due to OCR (optical character recognition) misidentification. This is more frequent for old issues where the quality of the scanned pages is lower and can result in a lower efficiency for the first years of the newspapers.

This can be assumed to be independent of the number, type, or location of people killed k so that the temporal behaviour and distribution laws should not be affected. See Additional file 1 for a discussion on systematic errors.

Results and discussion

The analysis of the data sets created above can yield information on how the newspapers consider the various events depending on geographic location, historical period, or gender. In this section, we describe the main algorithms employed and the results they provide. The various algorithms, the processing steps, and the key results derived, are also schematized in Fig. 2.

Historical events

Figure 3 shows the yearly total number of articles, T(t). Major historical events can also affect this value: for instance, it increases in NYT during the US Civil War due to more articles being published and decreases in CDS and STA during the two World Wars due to due to shortage of materials resulting in fewer pages being published.

In the same Figure the number of articles with the keyword ‘killed’, K(t), is also shown. From it, we can derive the normalized fraction of ‘killed’ events in respect to the total: $R=K/T$ (shown in the same figure), a value more affected by historical events.

For the events with a determined location, it is possible to separate domestic ($K_d$) from foreign ($K_f$) occurrences and assess how their relative importance evolved with time (Fig. 3). NYT reporting on foreign deaths grows over time to become more frequent than domestic at the onset of WW1 and permanently from WW2 on. The Italian newspapers divide the reporting between domestic and foreign cases roughly equally, with CDS covering more foreign events in the more recent years and STA and REP the domestic cases. The other relevant features are the foreign peaks during the World Wars and the drop in the domestic deaths between 1923 and 1945 for STA and CDS, due to censorship from the Fascist government (see below).

In Fig. 4 are shown the relative contributions of the various continents and their evolution over time, showing a gradual reduction of the coverage of European events and a growing importance of Asia after WW2. Some major occurrences are:

American Civil war (12/4/1861-13/5/1865)

During the American Civil war, the total number of NYT articles increases by $18.5\%$ in respect to the interpolated values of T between 1861 and 1865. K increases to a maximum of $K_{1863}/K_{1860}=2.80$ the pre-conflict value (the keyword ‘casualties’ has an increase of $C_{1863}/C_{1860}=4.23$). Conversely, the number of suicides drops to a minimum of $S_{1864}/K_{1860}= 0.36$.

World War 1 (28/7/1914-11/11/1918, Italy from 23/5/1915, US from 6/4/1917)

Censorship was very strong in all countries involved in both World Wars, with the removal of all information who could be beneficial to the enemy: letters, reporting of battles and defeats, casualties etc. [18].

In NYT the value of $R=K/T$ increases from $R_{1913}=0.361 \pm 0.001$ to $R_{1914}=0.457\pm 0.001$. The domestic event ratio $R'=K_{dom}/K_{total}$ drops from $R_{1913}^{'\, NYT}=0.56\pm 0.02$ to $R_{1914}^{'\, NYT}=0.39\pm 0.01$, reaching a minimum of $R_{1917}^{'\, NYT}=0.32\pm 0.01$, when the US declared war to Germany.

In Italy, the shortage of resources resulted in a reduction of the number of pages of CDS and STA from 8 (two double sheets) to 4 (one double sheet of paper). Consequently, T decreased to $T^{CDS}_{1918}/T^{CDS}_{1915}=0.48$ and $T^{STA}_{1918}/T^{STA}_{1915}=0.67$. The effect of censorship is evident in STA: its value of R drops from $R_{1915}=0.32\pm 0.02$ to a minimum of $R_{1917}=0.13\pm 0.01$ (for CDS is more constant). In both newspapers, there is an even higher drop in domestic events (not necessarily only due to censorship but also to a lack of interest): from $R_{1914}^{'\, STA}=K_{dom}/K_{total}= 0.43\pm 0.07$ to $R_{1917}^{'\, STA}=0.26\pm 0.09$. For CDS, R drops from $R_{1914}^{'\, CDS}=0.51 \pm 0.06$ to $R_{1917}^{'\, CDS}=0.29 \pm 0.04$.

During wars the rate of suicides is known to decrease: this phenomenon is usually explained by the higher sense of purpose during the bellic effort [19,20,21] and is found to occur both when one’s country or other countries are at war. However, the decrease of suicides reporting by newspapers is more prompt and intense (dropping to 1/3 of the pre-war value) than that recorded by statistics. Since in both countries this occurred in 1914, before either country was at war we can conclude that this was not directly related to censorship. In Italy, the number of articles on suicides passes from $S^{STA}_{1913}/T^{STA}_{1913}=0.043\pm 0.002$ to $S^{STA}_{1914}/T^{STA}_{1914}=0.032\pm 0.002$ and reaches a minimum of $S^{STA}_{1917}/T^{STA}_{1917}=0.012\pm 0.001$ before returning to $S^{STA}_{1919}/T^{STA}_{1919}=0.025 \pm 0.002$. A similar behaviour is found for CDS (and NYT): from $S^{CDS}_{1913}=1.34\pm 0.06$ ($S^{NYT}_{1913}=1.18\pm 0.04$) to $S^{CDS}_{1914}=0.73\pm 0.04$ ($S^{NYT}_{1914}=0.92\pm 0.04$ ) at the beginning of WW1 to a minimum $S^{CDS}_{1916}=0.45\pm 0.03$ ($S^{NYT}_{1916}=0.39 \pm 0.02$). The drop before the US or Italy entered the war suggests to attribute the lack of suicide reporting to a reduced interest by the editorial rooms rather than to censorship.

Fascist government in Italy (31/10/1922 - 25/7/1943)

Different is the case during the Fascist government in Italy. The Italian government of the time exercised a strong censorship on printed press and radio. On 14/7/1924 a circolare (note) from the then Minister of Interior Federzoni allows the sequestering of copies of newspapers to ’prevent stirring up public opinion’. On 31/12/1924 all newspapers are sequestered and the directors replaced with ones affiliated with the regime. In October 1926, several daily newspapers were closed until the end of WW2. Among them L’Unità, L’Avanti! and L’Ora [14, 15].

Government censorship aimed to present an efficient state and thus had to remove all negative news. Censorship involved all media of the time: radio, theater, movies, books, and newspapers. Authors, especially those of Hebraic origin but also those who were against the regime for political reasons, fell in disfavour. This ‘targeted’ censorship was similar to what occurred in Germany and reported in [10], where prominent individuals were mentioned to a greater or lesser extent according to their race or standing in respect to the Nazi government.

On a wider scale, government guidelines [16, 17] to newspapers required that crime reporting had to be compressed in a few lines, and suicides had to be ignored, with the result that articles involving domestic deaths and accidents almost disappeared from newspapers. With the datasets of CDS and STA it is possible to quantify the overall effect of Fascist censorship in the reporting of violent deaths [22, 23].

As a result, even though the value of $R=K/T$ remains more or less constant, in 1923 $R'$ drops from $R^{'\, CDS}_{1922}=0.56 \pm 0.03$ ($R^{'\, STA}_{1922}= 0.64\pm 0.05$) to $R^{'\, CDS}_{1923}=0.44\pm 0.02$ ($R^{'\, STA}_{1922}=0.46\pm 0.05$) reaching a minimum of $R^{' CDS}_{1936}=0.119\pm 0.005$ ($R^{' STA}_{1937}=0.31\pm 0.01$) (Fig. 3). These values come back to pre-dictatorship values of $R^{' CDS}_{1946}=0.60 \pm 0.03$ ($R^{' STA}_{1946}=0.68\pm 0.06$) immediately after the war when both newspapers had their publications halted and their directors were replaced between April and May (July for STA) 1945.

We estimated the amount of censorship for articles with morti interpolating the value of $R'$ between 1922 and 1946: between 1923 and 1943 there were 2,800$\pm 200$ domestic articles with at least two casualties missing for CDS and 2,900$\pm 300$ for STA. In both cases, they amount to $75\%$ of the articles featuring domestic deaths printed in the same period.

A similar analysis on the $k=1$ events (morto, morta, see Fig. 5) yields about $22,000\pm 1300$ articles censored by STA ($60\%$ of those printed in the period considered) and $29,000\pm 500$ by CDS ($20\%$ of those printed in the period considered).

In the case of suicides (another forbidden topic during the regime[15]), Fig. 6 gives $S_{missing}=17,000\pm 600$ ($S_{missing}/S_{present}=2.8$) for CDS and 4,100$\pm 300$ ($S_{missing}/S_{present}=1.1$) for STA, in contrast with the growing rate of suicides at the time [20, 24].

Overall we estimate that CDS censored $41,800\pm 1000$ articles and STA $36,000\pm 1,900$ during this period, with an average of $1990\pm 50/year$ for CDS and $1700\pm 90/year$ for STA.

World War 2 (1/9/1939–2/9/1945, Italy from 10/6/1940 as part of the Axis and from 8/9/1943 as part of the Allies, USA from 7/12/1941)

Censorship in US during WW2 relied mostly upon the self-censorship of news outlets: forbidden topics included weather and crop reports, correspondence, travel schedules and naturally troop movements [25]. The Office of Censorship released on January 15, 1942, the Code of Wartime Practices for American Broadcasters and the Code of Wartime Practices for the American Press. The publication of any pictures depicting US soldiers killed in combat was forbidden until September 1943, when the capitulation of Italy might have induced in the general public the idea that the war was to end soon [18].

The total number of articles of NYT remains essentially unchanged, with R increasing only in 1940, from $R_{1938}=2.02\pm 0.04\%$ to $R_{1940}=2.50\pm 0.04\%$. This is due to an increased amount of reporting of casualties prior to the entrance in the conflict. This is confirmed by the value of $R'$, from $R'_{1938}=K_{foreign}/K_{domestic}=0.63\pm 0.01$ to a maximum of $R'_{1940}=K_{foreign}/K_{domestic}=0.76\pm 0.01$, again before Pearl Harbor. Afterward, since US soil was not attacked, the amount of reporting of domestic events was not affected so we can conclude it was not affected by censorship.

The effect of WW2 in Italy is mostly evident in the sharp drop in T and K in the later years of the war. This was both due to paper shortage and the bombings on Milano and Torino, where the newspapers were printed. After the armistice with the Allied forces (8/9/1943), Italy was split between the South, controlled by the Allies and La Repubblica di Salò in the North. After a short period free from Fascist government, CDS and STA are then aligned to the Nazi-controlled government of the North so the definition of ‘domestic’ and ‘foreign’ becomes fuzzier and the ratio $R'$ increases.

Recent decades

If we consider the reporting of deadly accidents in the last decades we see that it is constant or decreasing: in the NYT, after doubling between 1960 and 1969, $R^{NYT}$ remains—with fluctuations—constant up to 2017.

The Ratio R for STA decreases gradually after 1954, with $dR/dt_{STA\, 1954-2005}=(-\,0.1\pm 0.02) \% /year$.

Also CDS exhibits a similar but stronger decrease: $dR/dt_{CDS\, 1954-2005}=(-\,0.3\pm 0.02) \%/year$.

For REP, the amount of articles reporting violent death K increases with time but its percentage after 1995 decreases by $dR/dt_{1995-2005}=(-\,0.2 \pm 0.04 )\%/year$. We note that this trend of decreasing coverage given to violent events in all three Italian newspapers is opposite to the growth of the perceived threat of violence in Italy [26], so this phenomenon cannot be ascribed directly to press coverage.

Gender bias

In Italian newspapers, where the language allows to distinguish gender, we have also queried the databases for the term ‘morto’/‘morta’ (M), respectively the singular ($k=1$) masculine, and feminine form in the Italian language. ‘Morte’, the feminine plural could not be used since it also means ‘death’ in Italian and it would be difficult to semantically separate the two meanings. Besides, if both males and females are involved the term ’morti’ is used in Italian language, making the lemma ’morte’ of little use.

From Fig. 5 we see that the amount of reporting of female deaths (morta) is only $30 \%$ of all $k=1$ deaths. This has to be compared with the fatal accident standardized death rate (in 2005) in Italy of 36.1 (male) and 19.2 (female) per 100,000 deaths [27], and the probability for a 15-year old individual to die within 45 years, before reaching the age of 60 (45q15) [28] in 2010 of $7.9\%$ for men and $4.1\%$ for women.

Therefore, even accounting for the fact that male violent deaths are more frequent than female ones^{Footnote 4} and that these events would be more likely to be reported in the news, this still hints to some amount of gender bias in reporting. The female/male ratio of $\simeq 30\%$ is present in all newspapers and roughly constant over the years, with only REP showing an increase of reporting of female deaths of about $3\%/year$, from $22.6\%$ in 1985 to $37.65\%$ of 2007.

Scaling laws

The analysis of keywords associated with the number of persons involved allows to build the distribution function of the number of articles, $N_k$, reporting k people killed. As shown in Fig. 7, the distributions for all four newspapers considered can be described by a single power-law:

$$\begin{aligned} N_k=N(k)=A\cdot k^{-\gamma } \end{aligned}$$

(1)

in the range $2\le k \le 10^6$. The sharp peaks in $N_k$ for values of k that are multiples of 10, 100, 1000... are due to the rounding in excess to the nearest multiple of a power of 10 of the number of people reported (see below).

The Minuit [30] package (now in its second release, Minuit2) has been used to perform the fits.

Fitting of the power laws has been analogous to the methods that we used in the fitting of cosmic ray spectra of the PAMELA space-borne detector [31,32,33] (see also ext. data therein). Tests with varying bin sizes have shown no significant change in the fitting results and values of $\gamma$.

See Abbreviations section for the mathematical details; a discussion of the fitting methods and error systematic can be found in Additional file 1.

Power-law statistics has been found to describe the distribution of various natural phenomena, e.g. earthquakes [34], forest fires [35, 36], the cluster size of tropical trees [37], and is usually thought to arise through positive growth feedback [38, 39].

Galactic cosmic ray spectra also follow a power law, as a result of the statistical process of acceleration. Changes in the spectral indexes show the presence of additional sources or production phenomena [32, 33].

Power law distributions are also encountered in many human-related activities [40], from language distribution (Zipf law [41, 42]), the number of casualties in wars [43, 44],

and ethnic violence [45]. Also in these cases, they have been shown to arise through a ‘winner takes all’ type of a competitive network where a few elements grow to acquire a very large size [46, 47].

In newspapers, the distribution $N_k$ can be explained as the result of two main phenomena:

The convolution of various violent events and accidents. Road, train, air accidents, natural disasters and catastrophes have each their frequency and probability distribution, usually unknown but decreasing as k increases.

Articles with $k \simeq > 1000$ often do not describe a specific accident, but rather summarize global phenomena such as war, illness deaths (cancer, heart attack...) or automobile accidents per year, etc... .
The selection by the newsroom. The publishing criteria can change with time, location or censoring: events with higher k will have a higher probability of being selected for their importance. Conversely, foreign events, occurring in countries physically or socially far from the country where the newspaper is printed, will be more likely to be ignored, especially for low k.

Both phenomena can be approximated by a power law, with probability $P\propto k^{-\gamma _1}$ for an event to involve k casualties, and a probability (or efficiency) $\epsilon \propto k^{+\gamma _2}$ for the event to be picked by the newsroom. The overall probability is then $P_{tot}=\epsilon \cdot P \propto k^{-\gamma _1+\gamma _2}$ with $\gamma =\gamma _1+\gamma _2$.

The values of $\gamma$ of the four newspapers ($\gamma _{NYT} = 1.44\pm 0.06$, $\gamma _{CDS} = 1.61\pm 0.01$, $\gamma _{REP} =1.42\pm 0.07$, $\gamma _{STA} = 1.52\pm 0.09$, Table 2) show that the editorial processes are similar in Italy/US and converge to a narrow range of values, although the differences among them reflect the different emphasis to high/low k events in the newspapers. A ‘steep’ spectrum, with a high $\gamma$, results in a higher number of articles with low k and vice-versa. Defining $W=N_H/N_L=N_{k> 10}/N_{2\le k\le 10}$ we see (Table 2) that it ranges between $W_{CDS}=0.57$ (more articles with low k) and $W_{REP}=1.00$ (fewer articles with low k) with $W_{STA}=0.72$ and $W_{NYT}=0.93$ having intermediate values (see Abbreviations section and Fig. 12 for a calculation of the value of W as a function of $\gamma$).

The trend of the spectral index can be used to estimate the state of belligerence reported by the newspapers: the running average (current and four preceding years) of $\gamma (t)$ (Fig. 8) decreases in wartime due to the higher abundance of high k events results in a flatter distribution. Vice-versa in peacetime, when the distributions are dominated by low k events, $\gamma$ increases due to a steeper distribution. Thus, local minima in $\gamma (t)$ are present in NYT during the US Civil War ($\gamma = 1.21$), the two World Wars ($\gamma _{WW1} =1.26$, $\gamma _{WW2} =1.30$) and the Vietnam War ($\gamma =1.31$). Also, CDS and STA show similar local minima during WW1 ($\gamma ^{CDS}\simeq 1.2$ and $\gamma ^{STA}=0.8$) and WW2 ($\gamma ^{STA}=1.76$, $\gamma ^{CDS}=1.91$), followed by a sharp increase in STA (and more gradual in CDS) after 1945. It is also interesting to note that—notwithstanding the differences in $\gamma$—the trends of the spectral indexes of the newspapers are in good agreement among each other. This suggests that they all tend to react similarly to the conflicts occurring in the world (and vice-versa a discrepancy in the trend would indicate the presence of censorship).

Table 2 Spectral index $\gamma$ for a single power-law fit ($2\le k \le 10^6$) on the full datasets of the four newspapers

Full size table

Table 3 Spectral index for the four newspapers separating domestic and foreign datasets

Full size table

Geographical bias

In our case, using the articles where a location could be determined, we built distribution laws for domestic and foreign events. In Fig. 7 it is possible to see that the latter have a strong spectral break for $2\le k\le 10$ (absent or less prominent in the former). Fitting the distribution function with two power laws $\gamma _H=\gamma (k>10)$ and $\gamma _L=\gamma (2\le k\le 10)$ we see that $\Delta \gamma =\gamma _L-\gamma _H$ is always negative for foreign events, ranging from $\Delta \gamma _{REP}=-0.511\pm 0.005$ to $\Delta \gamma _{CDS}=-\,1.09\pm 0.01$. For Italian newspapers featuring domestic events, the distributions are closer to a pure power law (highest $\Delta \gamma =+\,0.28\pm 0.02$ for CDS). In all cases $\Delta \gamma$ is positive, implying that a higher emphasis is given to low k events. NYT has a smaller discrepancy between foreign ($\Delta \gamma =-\,0.672\pm 0.003$) and domestic events ($\Delta \gamma =-0.42\pm 0.01$). A plot of the value of $\gamma$vs $\Delta \gamma$ (Fig. 9) shows that the foreign and domestic categories are clearly separated in all four newspapers (Table 3). This suggests a difference in the editorial behaviour due to a lack of press coverage of accidents involving a small number of persons in foreign countries, considered to be not interesting enough to be reported in the press.

An estimation of the under- or over-reporting of low k events can be provided by the extrapolation of $\gamma _{H}$ to $2\le k \le 10$:

$$\begin{aligned} M=\frac{N_L - \int _{2}^{10}\alpha _{H} k^{-\gamma _{H}} dk }{N_L} \end{aligned}$$

(2)

with $N_L= \Sigma _{2}^{10}N_k$ and $\alpha _{H}$ coming from the power law fit of $k\ge 10$. Thus, M is the fraction of events with $2\le k\le 10$ missing ($M<0$) or exceeding ($M>0$) the value expected from a pure power law distribution ($M=0$). All four newspapers have $M_{for}\simeq -100\%$ in the foreign case, meaning that the editorial room decides to print only one event out of two if it involves ten or fewer casualties in a foreign country. We also note that Italian newspapers tend to print more news of domestic events (from $M_{dom}=16\%$ of REP to $M_{dom}= 71\%$ of CDS) than what expected from a pure power-law distribution. This can be attributed in part to a large domestic and local news section. Overall CDS has the largest difference in dealing with foreign and domestic events ($M_{for}^{CDS}=-122\%$–$M_{dom}^{CDS}=+71\%$ ) and the NYT the smallest ($M_{for}^{NYT}=-98\%$– $M_{dom}^{NYT}=+4\%$ ). See Abbreviations section and Fig. 13 for a calculation of the value of M as a function of $\delta \gamma$ and $\gamma _H$.

In many nations, there are too few events reported to perform an acceptable fit with a power law, therefore—for countries having at least 30 entries in a given newspaper—we used the ratio the $W=N_{2\le k\le 10}/N_{k>10}$ as an indicator of the intrinsic importance assigned to a given country.

In Fig. 10, $W_i$ is plotted as a function of the distance $D_i$ between the capital of the country i and Rome/Washington, according to the newspaper. It is possible to see how the value of W tends to increase with the distance (fewer events with low k and more with high k). The geographical bias appears to be stronger in Italian newspapers: a linear fit (Table 4) shows that the slopes are similar in Italian newspapers and higher by a factor $\simeq 5$ compared to NYT, a sign of an higher internationalization of the US paper: $dW/dx_{CDS}=(14\pm 1) \% / 1000\, km$; $dW/dx_{REP}=(11\pm 1) \% / 1000\, km$ and $dW/dx_{STA}=(17 \pm 1) \% / 1000\, km$, $dW/dx_{NYT}=(2.7\pm 0.3)\% / 1000\, km$.

Table 4 Slope of the linear fit of the ratio $N_{2\le k\le 10}/N_{k>10}$ as a function of the distance between Rome/Washington and the various world countries

Full size table

Social proximity effects also play a role in defining the values of the various countries: as also visible in Fig. 10, American countries have lower W than equally distant Asian and African ones.

If we limit the fit to countries in Europe (for Italian newspapers) and in America (for NYT) we find that the slopes are higher by a factor 2 to 4 than for those considering all world nations (Table 4) a sign that geographical distance plays a stronger role for countries that are socially closed to either Italy or US (although the different geography of the American continent plays a role in the different behaviour).

Editorial rounding by excess of casualties as an additional tool to detect censorship

Newspapers often round up the number of casualties reported: this can be due to lack of knowledge, to simplify the headline, or to purposely increase emphasis to attract the attention of the reader. In the absence of tampering, we would expect the least significant digit of k, $l_k$ to follow a uniform distribution ($P(l_k)=1/10$). However, in Fig. 11, which shows the distribution $l_k$ for $10<k\le 100$, it is possible to see how the values ‘0’ and ‘5’ are overabundant (‘5’ only in the Italian papers) and the others under-abundant in respect to the flat distribution expectation. The number of defects in the digits ‘6’ to ‘9’ are close to the excess of ‘0’ (and similarly for ‘1’ to ‘4’ with ‘5’), confirming the artificial nature of the reported number of casualties. Overall, Italian newspapers have a value of $l_0+l_5= 40-46\%$ and NYT has a value of 32$\%$ (with respect to the 20$\%$ expected for the sum of the $k_0+k_5$ bins). All distributions (see Additional file 1: Table S1) are incompatible with the null hypothesis of the flat distribution with $p>0.01$.

This distribution is found in all newspapers and historical periods with one notable exception: during the Fascist government, the domestic distributions of STA and CDS do not exhibit the peaks for $k_0$ and $k_5$, still present in the corresponding foreign distributions of the same period. A $\chi ^2$ test allows us to reject the hypothesis that foreign and domestic histograms of Italian newspapers follow the same distributions with $p>0.01$. Furthermore, the domestic distributions of STA and CDS are the only ones that are not incompatible with the equiprobable one (Additional file 1: Figure S3 and Table S1). This implies that during the Fascist regime the editorial practice was to report more faithfully the number of domestic casualties as an additional way of suppressing these events and at the same time exaggerating the number of foreign casualties.

This phenomenon is similar but specular to Benford’s law [48, 49], which describes the statistical distribution of the most significant digit in several natural and man-made datasets [50]. Benford’s law has been used to determine the presence of accounting [51] or election fraud [52], since values that are artificially altered do not follow it.

Conclusions

In this work we have developed a series of techniques to automatically scan the complete historical databases of daily newspapers for the occurrence of specific keywords related to accidents and death. We also devised various algorithms to analyze the dependence of these keywords from the geographical location and historical period. Over traditional analysis, consisting of manual scanning of the newspaper articles, these tools offer the advantage of being automatic and thus being applicable on larger datasets spanning longer time periods. Indeed, these tools are suited for historical analysis to evaluate quantitatively the presence of bias or censorship in a given publication and its variation over time and political environment. This paper considered printed daily newspapers, but these techniques can be used also on online publications, news outlets, etc... Although the usage is limited only to articles with quantitative keywords, where to the event (accident, casualty, death...) is associated the number of persons involved (The simpler word occurrence methods can be used on a wider word set), this type of analysis is complementary to the assessment of ’fake news’ since has the advantage of being automatic and operating on large data sets. These tools can also contribute to the assessment of the freedom of the press in a given country.

Future work will extend the application of these tools to other lemmas such as casualties, wounded, victims. The analysis will be applied to differences in reporting between the type of accidents, both of man-made origin (e.g. train/airplane/ship) and natural calamities. Also the structure of reporting as a function of the day of the week and the page location will be considered. On a larger scope. also other newspapers, magazines, online publications will be considered, extending the analysis to look for the presence and evolution of ethnic or national bias. A more long-term goal can also be the use of speech recognition methods to study the occurrence of these lemmas on radio and TV.

Availability of data and materials

The newspaper archives are accessible at these locations: New York Times: https://www.nytimes.com/search/. Il Corriere della Serahttp://archivio.corriere.it/. La Repubblica: https://ricerca.repubblica.it/. La Stampa: http://www.archiviolastampa.it/. ROOT Ntuples with processed data used in the analysis will be made available upon reasonable request.

Notes

New York Times Company form 10-K, 2017.
Data on number of printed copies retrieved from http://www.adsnotizie.it
We used the selenium (https://www.selenium.dev/) package with python bindings to automatize the interaction with the archives.
For instance, in US the number of fatal occupational injuries for US in 2016 involved women in $7\%$ and men in $93\%$ of cases, for an almost equal number of worked hours ($43\%$ for women and $57\%$ for men) [29].

Abbreviations

CDS:: Il Corriere della Sera
NYT:: The New York Times
OCR:: Optical character recognition
REP:: La Repubblica
STA:: La Stampa
WW1:: World War 1
WW2:: World War 2

References

Ariès P, Ranum P. Western attitudes toward death: from the middle ages to the present. Western attitudes toward death. Johns Hopkins University Press; 1975. https://books.google.it/books?id=3sZJN3wojesC.
Combs B, Slovic P. Newspaper coverage of causes of death. J Q. 1979;61:837–49.
Google Scholar
Altheide DL. Creating reality: How TV news distorts events. A SageMark edition. Sage Publications; 1977. https://books.google.it/books?id=ml21yAEACAAJ.
Hanusch F. Valuing those close to us. J Stud. 2008;9(3):341–56. https://doi.org/10.1080/14616700801997281.
Article Google Scholar
Burdach KJ. Reporting on deaths: the perspective coverage of accident news in a German tabloid. Eur J Commun. 1988;3(1):81–9. https://doi.org/10.1177/0267323188003001005.
Article Google Scholar
Stigler SM. Francis Galton’s account of the invention of correlation. Statist Sci. 1989;4(2):73–9. https://doi.org/10.1214/ss/1177012580.
Article MathSciNet MATH Google Scholar
Dember WN, Warm JS. Psycology of perception. New York: Holt, Rinehart and Winston; 1979.
Google Scholar
jen Tsang K. News photos in time and newsweek. J Q. 1984;56(723):578–84.
Google Scholar
Mortality. Graphic death in the news media: present or absent? 2008;13(4):301–17. https://doi.org/10.1080/13576270802383840.
Article Google Scholar
Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, , et al. Quantitative analysis of culture using millions of digitized books. Science. 2011;331(6014):176–182. http://science.sciencemag.org/content/331/6014/176.
Smith R, Antonova D, Lee DS. Adapting the tesseract open source OCR engine for multilingual OCR. In: Proceedings of the international workshop on multilingual OCR. MOCR ’09. New York, NY, USA: Association for Computing Machinery; 2009. https://doi.org/10.1145/1577802.1577804.
Niyogi P. The informational complexity of learning: perspectives on neural networks and generative grammar. Berlin: Springer; 1998.
Book Google Scholar
Brun R, Rademakers F. ROOT—an object oriented data analysis framework. Nuclear instruments and methods in physics research section A: accelerators, spectrometers, detectors and associated equipment. 1997;389(1):81 – 86. New Computing Techniques in Physics Research V. http://www.sciencedirect.com/science/article/pii/S016890029700048X.
Smith DM. Storia di cento anni di vita italiana visti attraverso il Corriere della sera. Rizzoli; 1978.
Melograni P. Il Corriere della sera (1919–1943). Cappelli Editore. 1965.
Nicola Tranfaglia ML Paolo Murialdi. La stampa italiana nell’età fascista. Editori Laterza. 1980.
Murialdi P. Storia del giornalismo italiano. Il Mulino. 2014.
James D Ciment TR. The home front encyclopedia: United States, Britain, and Canada in World Wars I and II. vol. vol 1–3. 1st ed. ABC-CLIO; 2006.
Rojcewicz SJ. War and suicide. Suicide and life-threatening behavior. 1(1):46–54. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1943-278X.1971.tb00598.x.
Somogyi S. Il suicidio in Italia 1864–1962. Analisi statistica. Milano: Giuffrè edit; 1967.
Google Scholar
O’Malley P. SUICIDE AND WAR: A case study and theoretical appraisal. Br J Criminol. 1975;15(4):348–359. http://www.jstor.org/stable/23636204.
Felice RD. Mussolini e il fascismo. Mussolini il duce. Lo stato totalitario 1936–1940. vol. 5. Einaudi. 1996.
Aquarone A. L’organizzazione dello Stato totalitario. Einaudi. 1995.
Alessandri G. Il suicidio in Italia. Aggiornamenti sociali. 1971;1(22):45–56.
Google Scholar
Sweeney MS. Secrets of victory: The office of censorship and the American Press and radio in World War II. 1st ed. The University of North Carolina Press; 2001.
ISTAT. La percezione della sicurezza. Statistical Report. 2018. https://www.istat.it/it/archivio/217502.
Eurostat;. 2018. https://ec.europa.eu/eurostat/web/health/causes-death.
Rajaratnam JK, Marcus JR, Levin-Rector A, Chalupka AN, Wang H, Dwyer L, et al. Worldwide mortality in men and women aged 15–59 years from 1970 to 2010: a systematic analysis. Lancet. 2010;375(9727):1704–20. https://doi.org/10.1016/S0140-6736(10)60517-X.
Article Google Scholar
of labor statistics USB. 2018. https://www.bls.gov/iif/oshwc/cfoi/cfch0015.pdf.
James F. MINUIT Function minimization and error analysis: reference manual version 94.1. CERN-D-506. 1994.
Adriani O, Barbarino GC, Bazilevskaya GA, Bellotti R, Boezio M, Bogomolov EA, et al. The PAMELA Mission: Heralding a new era in precision cosmic ray physics. Physics Reports. 2014;544(4):323 – 370. The PAMELA Mission: Heralding a new era in precision cosmic ray physics.
Adriani O, Barbarino GC, Bazilevskaya GA, Bellotti R, Boezio M, Bogomolov EA, et al. PAMELA measurements of cosmic-ray proton and helium spectra. Science. 2011;332:69.
Article Google Scholar
Adriani O, Barbarino GC, Bazilevskaya GA, Bellotti R, Boezio M, Bogomolov EA, et al. An anomalous positron abundance in cosmic rays with energies 1.5–100 GeV. Nature. 2009;458:607–9.
Article Google Scholar
Bak P, Tang C. Earthquakes as a self-organized critical phenomenon. J Geophys Res Solid Earth;94(B11):15635–15637. https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB094iB11p15635.
Malamud BD, Morein G, Turcotte DL. Forest fires: an example of self-organized critical behavior. Science. 1998;281(5384):1840–2. http://science.sciencemag.org/content/281/5384/1840.
Telesca L, Amatucci G, Lasaponara R, Lovallo M, Rodrigues MJ. Space time fractal properties of the forest-fire series in central Italy. Commun Nonlinear Sci Numer Simul. 2007;12:1326–33.
Article Google Scholar
Condit R, Ashton PS, Baker P, Bunyavejchewin S, Gunatilleke S, Gunatilleke N, et al. Spatial patterns in the distribution of tropical tree species. Science. 2000;288(5470):1414–1418. http://science.sciencemag.org/content/288/5470/1414.
Scanlon TM, Caylor KK, Levin SA, Rodriguez-Iturbe I. Positive feedbacks promote power-law clustering of Kalahari vegetation. Nature. 2007;449:209. https://doi.org/10.1038/nature06060.
Article Google Scholar
Yakovlev G, Newman WI, Turcotte DL, Gabrielov A. An inverse cascade model for self-organized complexity and natural hazards. Geophys J Int. 2005;163:433–42.
Article Google Scholar
Clauset A, Shalizi CR, Newman MEJ. Power-law distributions in empirical data. SIAM Rev. 2009;51(4):661–703. https://doi.org/10.1137/070710111.
Article MathSciNet MATH Google Scholar
Newman M. Power laws, Pareto distributions and Zipf’s law. Contemp Phys. 2005;46(5):323–51. https://doi.org/10.1080/00107510500052444.
Article Google Scholar
Moreno-Sanchez I, Font-Clos F. Large-scale Corral A, analysis of Zipf’s law in english texts. PLOS ONE. 2016;11(1):1–19. https://doi.org/10.1371/journal.pone.0147073.
Article Google Scholar
Richardson LF. Variation of the frequency of fatal quarrels with magnitude. J Am Stat Assoc. 1948;43(244):523–46. https://www.tandfonline.com/doi/abs/10.1080/01621459.1948.10483278.
Cederman LE. Modeling the size of wars: from billiard balls to sandpiles. Am Polit Sci Rev. 2003;97(1):135–50.
Article Google Scholar
Lim M, Metzler R, Bar-Yam Y. Global pattern formation and ethnic/cultural violence. Science. 2007;317(5844):1540–4. http://science.sciencemag.org/content/317/5844/1540.
Barabási AL. The origin of bursts and heavy tails in human dynamics. Nature. 2005;435:207 EP. https://doi.org/10.1038/nature03459.
Article Google Scholar
Bohorquez JC, Gourley S, Dixon AR, Spagat M, Johnson NF. Common ecology quantifies human insurgency. Nature. 2009;462:911 EP. https://doi.org/10.1038/nature08631.
Article Google Scholar
Newcomb S. Note on the frequency of use of the different digits in natural numbers. Am J Math. 1881;4:39–40.
Article MathSciNet Google Scholar
Benford F. The law of anomalous numbers. Proc Am Philos Soc. 1938;78(4):551–572. http://www.jstor.org/stable/984802.
Morzy M, Kajdanowicz T, Benford’s Szymanski BK. Distribution in complex networks. Sci Rep. 2016;6:34917 EP. https://doi.org/10.1038/srep34917.
Article Google Scholar
Durtschi C, Hillison W, Pacini C. The effective use of Benford’s law to assist in detecting fraud in accounting data. J Forensic Acc. 2004;01:5.
Google Scholar
Jimenez R, Hidalgo M, Klimek P. Testing for voter rigging in small polling stations. Sci Adv. 2017;3(6). http://advances.sciencemag.org/content/3/6/e1602363.
Dunn HL. Vital Statistics of the United states. Government of the United States of America; 1945.
OECD2018. Suicide rates (indicator), 10.1787/a82f3459-en;. https://www.oecd-ilibrary.org/social-issues-migration-health/suicide-rates/indicator/english_a82f3459-en.

Download references

Acknowledgements

The author is grateful to Dr. A. Truzzi for the useful and stimulating discussions during the preparation of this work and Drs U. T. Casolino, W. Husein, O. Larsson, L. Marcelli, L. Sorge and M. Piersanti for reviewing the manuscript. The author also wishes to thank the four newspapers considered (The New York Times, Il Corriere della Sera, La Repubblica, La Stampa) for having their historical archives freely available for consultation: without these resources the paper could not have been written.

Funding

None.

Author information

Authors and Affiliations

INFN Structure of Roma Tor Vergata, Via della Ricerca Scientifica 1, Rome, Italy
Marco Casolino
Riken, Computational Astrophysics lab, Hirosawa, Wako, Japan
Marco Casolino
Dipartimento di Fisica, Università di Roma Tor Vergata, Via della Ricerca Scientifica 1, Rome, Italy
Marco Casolino

Authors

Marco Casolino
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The author read and approved the final manuscript.

Authors’ information

Marco Casolino’s research topics are mostly focused on the study of cosmic rays and radiation environment with the construction and launch of space-borne detectors, both on satellites and International Space Stations. Among the satellite-borne detectors he was involved in the Nina and Mita instruments to study the MeV-GeV component of cosmic rays, and in the Pamela magnetic spectrometer to perform the first detailed measurement of the antiparticle component (positrons, antiprotons) in galactic cosmic rays, of relevance for astrophysics and indirect dark matter search. On the Space station Mir and International he was involved in the study of the light flash phenomenon observed by astronauts and in the determination of the radiation environment in low Earth orbit. He has been PI of several instruments onboard the International Space Station, the most recent of which is the Mini-EUSO telescope, observing the night-time Ultraviolet emissions of terrestrial, atmospheric, and cosmic origin. He was also involved in a technological transfer project to apply detector technology from space to the realization of a detector for radiation in food after the Fukushima accident. This work utilized many of the data processing techniques developed for the analysis of the data coming from these instruments.

Corresponding author

Correspondence to Marco Casolino.

Ethics declarations

Competing interests

The author declres no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Additional statistical data on systematic error and the distribution of least significant digit.

Appendix

Estimation of the missing events

For a power-law distribution of the number of articles N(k) reporting that k people have been killed:

$$\begin{aligned} N(k)=\alpha k^{-\gamma } \end{aligned}$$

(3)

the total number of articles with two or more people killed is:

$$\begin{aligned} N_{tot}=\frac{\alpha }{\gamma -1} 2^{1-\gamma } \,\,\,\,\, \gamma >1 \end{aligned}$$

(4)

for a given $N_{tot}$ we have therefore that:

$$\begin{aligned} N(k)=\frac{N_{tot}(\gamma -1)}{2^{1-\gamma }} k^{-\gamma } \end{aligned}$$

(5)

In a single power law, the ratio W of articles with less ($N_L$) and more ($N_H$) than 10 people dead, is:

$$\begin{aligned} W = \frac{10^{1-\gamma }}{2^{1-\gamma }-10^{1-\gamma }}=\frac{1}{5^{\gamma -1}-1} \end{aligned}$$

(6)

A high value of $\gamma$ implies an emphasis on articles with low k and vice-versa. For $\gamma =1.43$ we have an equal number of articles with $2\le k\le 10$ and $k>10$ (Fig. 12).

For a broken power law distribution:

$$\begin{aligned} N(k) = \left\{ \begin{array}{l l} \alpha k^{-\gamma _L} & \quad 2\le k \le 10 \\ \beta k^{-\gamma _H} & \quad k>10 \\ \end{array} \right. \end{aligned}$$

(7)

so

$$\begin{aligned} N_{L} = & {} \frac{\alpha }{1-\gamma _L}\left( 10^{1-\gamma _L}-2^{1-\gamma _L}\right) \quad 2\le k \le 10 \end{aligned}$$

(8)

$$\begin{aligned} N_{H} = & {} -\frac{-\beta }{1-\gamma _H}\left. 10^{1-\gamma _H}\right. \quad k > 10 \end{aligned}$$

(9)

$$\begin{aligned} M= & {} 1-\frac{ \int _{2}^{10}\beta k^{-\gamma _{H}} dk }{\int _{2}^{10}\alpha k^{-\gamma _{L}} dk }= 1- \frac{1-\gamma _{L}}{1-\gamma _{H}}10^{\gamma _{H}-\gamma _{L}}\frac{10^{1-\gamma _{H}}-2^{1-\gamma _{H}}}{10^{1-\gamma _{L}}-2^{1-\gamma _{L}}} \end{aligned}$$

(10)

The excess ($M>0$) or defect ($M>0$) in respect to a pure power law ($M=0$) thus depends on the two values $\gamma _H$ and $\gamma _L$. They are shown in Fig. 13.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Casolino, M. Large scale analysis of violent death count in daily newspapers to quantify bias and censorship. J Big Data 7, 60 (2020). https://doi.org/10.1186/s40537-020-00338-1

Download citation

Received: 10 March 2020
Accepted: 30 July 2020
Published: 11 August 2020
DOI: https://doi.org/10.1186/s40537-020-00338-1

Large scale analysis of violent death count in daily newspapers to quantify bias and censorship

Abstract

Introduction

Methods

Newspaper archives

Lemmas considered

Construction of the data sets

Errors

Results and discussion

Historical events

American Civil war (12/4/1861-13/5/1865)

World War 1 (28/7/1914-11/11/1918, Italy from 23/5/1915, US from 6/4/1917)

Fascist government in Italy (31/10/1922 - 25/7/1943)

World War 2 (1/9/1939–2/9/1945, Italy from 10/6/1940 as part of the Axis and from 8/9/1943 as part of the Allies, USA from 7/12/1941)

Recent decades

Gender bias

Scaling laws

Geographical bias

Editorial rounding by excess of casualties as an additional tool to detect censorship

Conclusions

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Supplementary information

Additional file 1.

Appendix

Appendix

Estimation of the missing events

Rights and permissions

About this article

Cite this article

Share this article

Keywords