Appendices

The State of the Union Historically

The following chart shows the timing of the messages and the distribution of oral and written delivery as well as indicating the increasing publicity of the address starting in the 20th Century.

Key:
 
Oral deliveries in Red
Written deliveries in Black
*Radio delivery after written message marked with asterisk
Increase in publicity:
No Broadcast  
Radio 1923
Television 1947
Evening 1965
Web 2002
  Time of Message
President Term 1st 2nd 3rd 4th End
George Washington 1789-1793 1790 1790 1791 1792  
  1793-1797 1793 1794 1795 1796  
John Adams 1797-1801 1797 1798 1799 1800  
Thomas Jefferson  1801-1805 1801 1802 1803 1804  
  1805-1809 1805 1806 1807 1808  
James Madison 1809-1813 1809 1810 1811 1812  
  1813-1817 1813 1814 1815 1816  
James Monroe 1817-1821 1817 1818 1819 1820  
  1821-1825 1821 1822 1823 1824  
John Quincy Adams 1825-1829 1825 1826 1827 1828  
Andrew Jackson 1829-1833 1829 1830 1831 1832  
  1833-1837 1833 1834 1835 1836  
Martin Van Buren 1837-1841 1837 1838 1839 1840  
William Henry Harrison 1841          
John Tyler 1841-1845 1841 1842 1843 1844  
James K. Polk 1845-1849 1845 1846 1847 1848  
Zachary Taylor 1849-1850 1849        
Millard Fillmore 1850-1853   1850 1851 1852  
Franklin Pierce 1853-1857 1853 1854 1855 1856  
James Buchanan 1857-1861 1857 1858 1859 1860  
Abraham Lincoln 1861-1865 1861 1862 1863 1864  
Andrew Johnson 1865-1869 1865 1866 1867 1868  
Ulysses S. Grant 1869-1873 1869 1870 1871 1872  
  1873-1877 1873 1874 1875 1876  
Rutherford B. Hayes 1877-1881 1877 1878 1879 1880  
James A. Garfield 1881          
Chester A. Arthur 1881-1885 1881 1882 1883 1884  
Grover Cleveland 1885-1889 1885 1886 1887 1888  
Benjamin Harrison 1889-1893 1889 1890 1891 1892  
Grover Cleveland 1893-1897 1893 1894 1895 1896  
William McKinley 1897-1901 1897 1898 1899 1900  
Theodore Roosevelt 1901-1905 1901 1902 1903 1904  
  1905-1909 1905 1906 1907 1908  
William Howard Taft 1909-1913 1909 1910 1911 1912  
Woodrow Wilson 1913-1921 1913 1914 1915 1916  
  1917-1921 1917 1918 1919 1920  
Warren G. Harding 1921-1923 1921 1922      
Calvin Coolidge 1923-1925     1923 1924  
  1925-1929 1925 1926 1927 1928  
Herbert Hoover 1929-1933 1929 1930 1931 1932  
Franklin D. Roosevelt 1933-1937   1934 1935 1936  
  1937-1941 1937 1938 1939 1940  
  1941-1945 1941 1942 1943 1944  
  1945 1945*        
Harry S Truman 1945-1949   1946 1947 1948  
  1949-1953 1949 1950 1951 1952 1953
Dwight D. Eisenhower 1953-1957 1953 1954 1955 1956*  
  1957-1961 1957 1958 1959 1960 1961
John F. Kennedy 1961-1963 1961 1962 1963    
Lyndon B. Johnson 1964-1965       1964  
  1965-1969 1965 1966 1967 1968 1969
Richard M. Nixon 1969-1973   1970 1971 1972  
  1973-1974 1973 1974      
Gerald R. Ford 1974-1977     1975 1976 1977
Jimmy Carter 1977-1981   1978 1979 1980 1981
Ronald Reagan 1981-1985   1982 1983 1984  
  1985-1989 1985 1986 1987 1988  
George Bush 1989-1993   1990 1991 1992  
William J. Clinton 1993-1997   1994 1995 1996  
  1997-2001 1997 1998 1999 2000  
George W. Bush 2001-2005   2002 2003 2004  
  2005-2008 2005 2006 2007 2008  
Barack Obama 2009-2012 2009 2010 2011 2012  

(Source: http://www.presidency.ucsb.edu/sou.php)

Flesch-Kincaid Scores Comparatively

The Flesch-Kincaid score is commonly used for determining the age-appropriateness of reading material. It was developed originally in the 1940s by Rudolf Flesch, who wrote Why Johnny Can't Read. The current version of the formula was developed together with J.P. Kincaid for the Navy in the 1970s:

(0.39 * average_words_per_sentence)
+ (11.8 * average_syllables_per_word) - 15.59

It is a United States Government Department of Defense standard (DOD MIL-M-38784B). The score indicates that the text would be at the limit of comprehension for a person with the equivalent of that number of years of education; in a comprehension test, that person would answer 50 per cent of the questions correctly. It is estimated that the population of the U.S. has an average reading ability at the eighth-grade level.

The use of this metric is controversial. It does not account for sentence structure, vocabulary, style or context. This implementation is fraught with potential inaccuracies in the determination of the number of syllables in a word (possibly as much as  ± 10%). It is probably best at comparing similar types of text. There are several newer, potentially more accurate measures of readability. The accuracy and applicability of the metric over 200 years is doubtful given changes in both language and education. So the use here of Flesch-Kincaid requires some explanation.

Flesch-Kincaid remains one of the best known and frequently used metrics of readability. Its correlation with grade-levels gives a simple and accessible sense of the metrics meaning. It is a convenient quantitative marker of style that has broad current acceptance. In examining the historical corpus of the State of the Union, a set of documents with exceptional continuity over time, it provides a measure of the gradual changes in language use.

One interesting finding of the project is in fact that the trend of the score is so consistent and seemingly uncorrelated with particular presidents. From this study it is not possible to determine whether this is a change in language generally or one specific to political language.

The following chart shows Flesch-Kincaid Scores of a variety of texts:

Date Title Author Type Score
1611 King James Bible   Literature 11.0
1775 Give Me Liberty or Give Me Death Patrick Henry Speech 7.0
1776 Declaration of Independence   Document 15.1
1787 US Constitution   Document 17.8
1788 The Federalist Papers Alexander Hamilton, Et. Al. Essays 17.1
1850 Scarlet Letter Nathaniel Hawthorne Literature 11.0
1863 Gettysburg address Abraham Lincoln Speech 11.17
1865 Alice's Adventures in Wonderland Charles Dodgson Literature 6.3
1906 What It Means to be Colored Mary Church Terrell Speech 15.0
1906 Taxes and Morals Mark Twain Speech 9.1
1914 Tarzan of the Apes Edgar Rice Burroughs Literature 9.4
1921 The Morality of Birth Control Margaret Sanger Speech 11.5
1922 Ulysses James Joyce Literature 6.8
1929 A Room of One's Own Virginia Woolf Literature 11.8
1932 Brave New World Aldous Huxley Literature 7.4
1950 Declaration of Conscience Margaret Chase Smith   13.7
1954 Lord of the Flies William Golding Literature 4.8
1960 To Kill a Mocking Bird Harper Lee Literature 6.0
1963 I Have a Dream Martin Luther King Jr. Speech 9.4
1964 The Ballot or the Bullet Malcolm X Speech 7.8
1966 In Cold Blood Truman Capote Literature 7.9
1973 New International Bible   Literature 13.5
1973 Gravity’s Rainbow Thomas Pynchon Literature 9.5
1991 Statement to the Senate Judiciary Committee on Clarence Thomas Anita Faye Hill Speech 8.9
1993 New York Times   Newspaper 14*
1993 LA Times   Newspaper 14*
1993 Washington Post   Newspaper 14*
1993 Associated Press   Newspaper 13*
1993 Wall Street Journal   Newspaper 11*
1993 Newsweek   Newspaper 11*
2004 Commencement at the U of Penn Bono Speech 5.9
2006 The (Sorry) State We Are In Brad Borevitz Essay 13.7


* Average Score

(Note: There is a rumor that USToday is written at 8th grade level, but I have found no documentation of that, and spot checks seem to indicate that it is probably in the same range–around 12–as other papers.)

(Sources: for Literature Amazon.com, for Newspapers Jack Hart, Editor & Publisher,
November 6, 1993 (v126 i45 p.5) quoted at http://answers.google.com/answers/threadview?id=301734, for others original analysis.)

Statistical Methods Used in this Project

The methods used in this project all rely on frequency counts of words. The size of the word is determined simply by the frequency of its occurrence in the document. The x (horizontal) position within the interface is determined by the average position of the word in the document. The y position of displayed words (the vertical) is determined by a calculation of the words Relative Frequency Score, S.

The S score is an attempt to quantify significance as a combination of frequency and a determination of the uniqueness of the words use within an individual document as compared to its use in the corpus as a whole. S is average of two statistics: the Log Likelihood Statistic (LLS), the and the Term Frequency–Inverse Document Frequency (TF-IDF) . Each of these scores has been normalized so that the minimum and maximum values of each component over the corpus are comparable (between 0 and 100).

The formula for the LLS is:

LLS = 2f * ((freqInCorp * log(freqInCorp / E1)) + (freqInDoc * log(freqInDoc / E2)))

where

E1 = wordsInCorp * (freqInCorp + freqInDoc) / (wordsInCorp + wordsInDoc)
E2 = wordsInDoc * (freqInCorp + freqInDoc) / (wordsInCorp + wordsInDoc)

In order to even out the scale of the LLS, the following formula was employed to derive the log of the LLS+1, (L1LLS):

L1LLS = log(LLS+1)

The formula for TF:

TF = freqInDoc/wordsInDoc

The formula for IDF:

IDF = log( docsInCorpus / docsWhereWordOccurs)

The formula for TF-IDF:

TF-IDF = TF * IDF

The formula for S is as follows:

S = (L1LLS + TF-IDF) / 2
(values are normalized before being averaged)

Words are filtered to eliminate the most commonly used words.

For each document, the words with the top 40 S scores are selected. Depending on the length of the address, words with too few occurences are also filtered out.

Source Code

Interface source code (Processing.js):
SotuDisplayJS.pde and jsStuff.js

SotuGraph source code (Processing.js):
SotuGraphJS.pde

Data:
Word frequency data (normalized as frequency per 10,000 words, i.e. 10000*count/length as integers by rounding down) by year for all words in tab delimited plain text file history.txt.gz
Analyzed word data for top words by document in JSON format documentsData.json.gz

Text of addresses:
stateoftheunion1790-2014.txt.zip