Appendices

The State of the Union Historically
Flesch-Kincaid Scores Comparatively
Statistical Methods Used in this Project
Source Code

The State of the Union Historically

The following chart shows the timing of the messages and the distribution of oral and written delivery as well as indicating the increasing publicity of the address starting in the 20th Century.

Key:

Oral deliveries in Red
Written deliveries in Black
*Radio delivery after written message marked with asterisk

Increase in publicity:

No Broadcast
Radio	1923
Television	1947
Evening	1965
Web	2002

		Time of Message
President	Term	1st	2nd	3rd	4th	End
George Washington	1789-1793	1790	1790	1791	1792
	1793-1797	1793	1794	1795	1796
John Adams	1797-1801	1797	1798	1799	1800
Thomas Jefferson	1801-1805	1801	1802	1803	1804
	1805-1809	1805	1806	1807	1808
James Madison	1809-1813	1809	1810	1811	1812
	1813-1817	1813	1814	1815	1816
James Monroe	1817-1821	1817	1818	1819	1820
	1821-1825	1821	1822	1823	1824
John Quincy Adams	1825-1829	1825	1826	1827	1828
Andrew Jackson	1829-1833	1829	1830	1831	1832
	1833-1837	1833	1834	1835	1836
Martin Van Buren	1837-1841	1837	1838	1839	1840
William Henry Harrison	1841
John Tyler	1841-1845	1841	1842	1843	1844
James K. Polk	1845-1849	1845	1846	1847	1848
Zachary Taylor	1849-1850	1849
Millard Fillmore	1850-1853		1850	1851	1852
Franklin Pierce	1853-1857	1853	1854	1855	1856
James Buchanan	1857-1861	1857	1858	1859	1860
Abraham Lincoln	1861-1865	1861	1862	1863	1864
Andrew Johnson	1865-1869	1865	1866	1867	1868
Ulysses S. Grant	1869-1873	1869	1870	1871	1872
	1873-1877	1873	1874	1875	1876
Rutherford B. Hayes	1877-1881	1877	1878	1879	1880
James A. Garfield	1881
Chester A. Arthur	1881-1885	1881	1882	1883	1884
Grover Cleveland	1885-1889	1885	1886	1887	1888
Benjamin Harrison	1889-1893	1889	1890	1891	1892
Grover Cleveland	1893-1897	1893	1894	1895	1896
William McKinley	1897-1901	1897	1898	1899	1900
Theodore Roosevelt	1901-1905	1901	1902	1903	1904
	1905-1909	1905	1906	1907	1908
William Howard Taft	1909-1913	1909	1910	1911	1912
Woodrow Wilson	1913-1921	1913	1914	1915	1916
	1917-1921	1917	1918	1919	1920
Warren G. Harding	1921-1923	1921	1922
Calvin Coolidge	1923-1925			1923	1924
	1925-1929	1925	1926	1927	1928
Herbert Hoover	1929-1933	1929	1930	1931	1932
Franklin D. Roosevelt	1933-1937		1934	1935	1936
	1937-1941	1937	1938	1939	1940
	1941-1945	1941	1942	1943	1944
	1945	1945*
Harry S Truman	1945-1949		1946	1947	1948
	1949-1953	1949	1950	1951	1952	1953
Dwight D. Eisenhower	1953-1957	1953	1954	1955	1956*
	1957-1961	1957	1958	1959	1960	1961
John F. Kennedy	1961-1963	1961	1962	1963
Lyndon B. Johnson	1964-1965				1964
	1965-1969	1965	1966	1967	1968	1969
Richard M. Nixon	1969-1973		1970	1971	1972
	1973-1974	1973	1974
Gerald R. Ford	1974-1977			1975	1976	1977
Jimmy Carter	1977-1981		1978	1979	1980	1981
Ronald Reagan	1981-1985		1982	1983	1984
	1985-1989	1985	1986	1987	1988
George Bush	1989-1993		1990	1991	1992
William J. Clinton	1993-1997		1994	1995	1996
	1997-2001	1997	1998	1999	2000
George W. Bush	2001-2005		2002	2003	2004
	2005-2008	2005	2006	2007	2008
Barack Obama	2009-2012	2009	2010	2011	2012
	2013-2016	2013	2014	2015	2016
Donald J. Trump	2017-2020	2017	2018	2019	2020
Joseph R. Biden	2021-2025	2021	2022	2023	2024
Donald J. Trump	2021-2025	2025

(Source: http://www.presidency.ucsb.edu/sou.php)

Flesch-Kincaid Scores Comparatively

The Flesch-Kincaid score is commonly used for determining the age-appropriateness of reading material. It was developed originally in the 1940s by Rudolf Flesch, who wrote Why Johnny Can't Read. The current version of the formula was developed together with J.P. Kincaid for the Navy in the 1970s:

(0.39 * average_words_per_sentence) + (11.8 * average_syllables_per_word) - 15.59

It is a United States Government Department of Defense standard (DOD MIL-M-38784B). The score indicates that the text would be at the limit of comprehension for a person with the equivalent of that number of years of education; in a comprehension test, that person would answer 50 per cent of the questions correctly. It is estimated that the population of the U.S. has an average reading ability at the eighth-grade level.

The use of this metric is controversial. It does not account for sentence structure, vocabulary, style or context. This implementation is fraught with potential inaccuracies in the determination of the number of syllables in a word (possibly as much as ± 10%). It is probably best at comparing similar types of text. There are several newer, potentially more accurate measures of readability. The accuracy and applicability of the metric over 200 years is doubtful given changes in both language and education. So the use here of Flesch-Kincaid requires some explanation.

Flesch-Kincaid remains one of the best known and frequently used metrics of readability. Its correlation with grade-levels gives a simple and accessible sense of the metrics meaning. It is a convenient quantitative marker of style that has broad current acceptance. In examining the historical corpus of the State of the Union, a set of documents with exceptional continuity over time, it provides a measure of the gradual changes in language use.

One interesting finding of the project is in fact that the trend of the score is so consistent and seemingly uncorrelated with particular presidents. From this study it is not possible to determine whether this is a change in language generally or one specific to political language.

The following chart shows Flesch-Kincaid Scores of a variety of texts:

Date	Title	Author	Type	Score
1611	King James Bible		Literature	11.0
1775	Give Me Liberty or Give Me Death	Patrick Henry	Speech	7.0
1776	Declaration of Independence		Document	15.1
1787	US Constitution		Document	17.8
1788	The Federalist Papers	Alexander Hamilton, Et. Al.	Essays	17.1
1850	Scarlet Letter	Nathaniel Hawthorne	Literature	11.0
1863	Gettysburg address	Abraham Lincoln	Speech	11.17
1865	Alice's Adventures in Wonderland	Charles Dodgson	Literature	6.3
1906	What It Means to be Colored	Mary Church Terrell	Speech	15.0
1906	Taxes and Morals	Mark Twain	Speech	9.1
1914	Tarzan of the Apes	Edgar Rice Burroughs	Literature	9.4
1921	The Morality of Birth Control	Margaret Sanger	Speech	11.5
1922	Ulysses	James Joyce	Literature	6.8
1929	A Room of One's Own	Virginia Woolf	Literature	11.8
1932	Brave New World	Aldous Huxley	Literature	7.4
1950	Declaration of Conscience	Margaret Chase Smith		13.7
1954	Lord of the Flies	William Golding	Literature	4.8
1960	To Kill a Mocking Bird	Harper Lee	Literature	6.0
1963	I Have a Dream	Martin Luther King Jr.	Speech	9.4
1964	The Ballot or the Bullet	Malcolm X	Speech	7.8
1966	In Cold Blood	Truman Capote	Literature	7.9
1973	New International Bible		Literature	13.5
1973	Gravity’s Rainbow	Thomas Pynchon	Literature	9.5
1991	Statement to the Senate Judiciary Committee on Clarence Thomas	Anita Faye Hill	Speech	8.9
1993	New York Times		Newspaper	14*
1993	LA Times		Newspaper	14*
1993	Washington Post		Newspaper	14*
1993	Associated Press		Newspaper	13*
1993	Wall Street Journal		Newspaper	11*
1993	Newsweek		Newspaper	11*
2004	Commencement at the U of Penn	Bono	Speech	5.9
2006	The (Sorry) State We Are In	Brad Borevitz	Essay	13.7

* Average Score

(Note: There is a rumor that USToday is written at 8th grade level, but I have found no documentation of that, and spot checks seem to indicate that it is probably in the same range–around 12–as other papers.)

(Sources: for Literature Amazon.com, for Newspapers Jack Hart, Editor & Publisher,
November 6, 1993 (v126 i45 p.5) quoted at http://answers.google.com/answers/threadview?id=301734, for others original analysis.)

Statistical Methods Used in this Project

The methods used in this project all rely on frequency counts of words. The size of the word is determined simply by the frequency of its occurrence in the document. The x (horizontal) position within the interface is determined by the average position of the word in the document. The y position of displayed words (the vertical) is determined by a calculation of the words Relative Frequency Score, S.

The S score is an attempt to quantify significance as a combination of frequency and a determination of the uniqueness of the words use within an individual document as compared to its use in the corpus as a whole. S is average of two statistics: the Log Likelihood Statistic (LLS), the and the Term Frequency–Inverse Document Frequency (TF-IDF) . Each of these scores has been normalized so that the minimum and maximum values of each component over the corpus are comparable (between 0 and 100).

The formula for the LLS is:

LLS = 2f * ((freqInCorp * log(freqInCorp / E1)) + (freqInDoc * log(freqInDoc / E2)))

where

E1 = wordsInCorp * (freqInCorp + freqInDoc) / (wordsInCorp + wordsInDoc) E2 = wordsInDoc * (freqInCorp + freqInDoc) / (wordsInCorp + wordsInDoc)

In order to even out the scale of the LLS, the following formula was employed to derive the log of the LLS+1, (L1LLS):

L1LLS = log(LLS+1)

The formula for TF:

TF = freqInDoc/wordsInDoc

The formula for IDF:

IDF = log( docsInCorpus / docsWhereWordOccurs)

The formula for TF-IDF:

TF-IDF = TF * IDF

The formula for S is as follows:

S = (L1LLS + TF-IDF) / 2 (values are normalized before being averaged)

Words are filtered to eliminate the most commonly used words.

For each document, the words with the top 40 S scores are selected. Depending on the length of the address, words with too few occurences are also filtered out.

Source Code

Interface source code (Processing.js):
SotuDisplayJS.pde and jsStuff.js

SotuGraph source code (Processing.js):
SotuGraphJS.pde

Data:
Word frequency data (normalized as frequency per 10,000 words, i.e. 10000*count/length as integers by rounding down) by year for all words in tab delimited plain text file history.txt.gz
Analyzed word data for top words by document in JSON format documentsData.json.gz

Text of addresses:
stateoftheunion1790-2021.txt.zip