State of the Union
Political Prose Over Time

Appendices

The State of the Union Historically

The following chart shows the timing of the messages and the distribution of oral and written delivery as well as indicating the increasing publicity of the address starting in the 20th Century.

Delivery:
Oral in bold
Written in normal weight
*Radio after written
marked with asterisk
Increase in publicity:
No Broadcast
Radio1923
Television1947
Evening1965
Web2002
President Term 1st 2nd 3rd 4th End
George Washington1789-17931790179017911792
1793-17971793179417951796
John Adams1797-18011797179817991800
Thomas Jefferson1801-18051801180218031804
1805-18091805180618071808
James Madison1809-18131809181018111812
1813-18171813181418151816
James Monroe1817-18211817181818191820
1821-18251821182218231824
John Quincy Adams1825-18291825182618271828
Andrew Jackson1829-18331829183018311832
1833-18371833183418351836
Martin Van Buren1837-18411837183818391840
William Henry Harrison1841
John Tyler1841-18451841184218431844
James K. Polk1845-18491845184618471848
Zachary Taylor1849-18501849
Millard Fillmore1850-1853185018511852
Franklin Pierce1853-18571853185418551856
James Buchanan1857-18611857185818591860
Abraham Lincoln1861-18651861186218631864
Andrew Johnson1865-18691865186618671868
Ulysses S. Grant1869-18731869187018711872
1873-18771873187418751876
Rutherford B. Hayes1877-18811877187818791880
James A. Garfield1881
Chester A. Arthur1881-18851881188218831884
Grover Cleveland1885-18891885188618871888
Benjamin Harrison1889-18931889189018911892
Grover Cleveland1893-18971893189418951896
William McKinley1897-19011897189818991900
Theodore Roosevelt1901-19051901190219031904
1905-19091905190619071908
William Howard Taft1909-19131909191019111912
Woodrow Wilson1913-19211913191419151916
1917-19211917191819191920
Warren G. Harding1921-192319211922
Calvin Coolidge1923-192519231924
1925-19291925192619271928
Herbert Hoover1929-19331929193019311932
Franklin D. Roosevelt1933-1937193419351936
1937-19411937193819391940
1941-19451941194219431944
19451945*
Harry S Truman1945-1949194619471948
1949-195319491950195119521953
Dwight D. Eisenhower1953-19571953195419551956*
1957-196119571958195919601961
John F. Kennedy1961-1963196119621963
Lyndon B. Johnson1964-19651964
1965-196919651966196719681969
Richard M. Nixon1969-1973197019711972
1973-197419731974
Gerald R. Ford1974-1977197519761977
Jimmy Carter1977-19811978197919801981
Ronald Reagan1981-1985198219831984
1985-19891985198619871988
George Bush1989-1993199019911992
William J. Clinton1993-1997199419951996
1997-20011997199819992000
George W. Bush2001-2005200220032004
2005-20092005200620072008
Barack Obama2009-20132009201020112012
2013-20172013201420152016
Donald J. Trump2017-20212017201820192020
Joseph R. Biden2021-20252021202220232024
Donald J. Trump2025-20292025

(Source: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union)

Flesch-Kincaid Scores Comparatively

The Flesch-Kincaid score is commonly used for determining the age-appropriateness of reading material. It was developed originally in the 1940s by Rudolf Flesch, who wrote Why Johnny Can't Read. The current version of the formula was developed together with J.P. Kincaid for the Navy in the 1970s:

(0.39 * average_words_per_sentence)
+ (11.8 * average_syllables_per_word) - 15.59

It is a United States Government Department of Defense standard (DOD MIL-M-38784B). The score indicates that the text would be at the limit of comprehension for a person with the equivalent of that number of years of education; in a comprehension test, that person would answer 50 per cent of the questions correctly. It is estimated that the population of the U.S. has an average reading ability at the eighth-grade level.

The use of this metric is controversial. It does not account for sentence structure, vocabulary, style or context. This implementation is fraught with potential inaccuracies in the determination of the number of syllables in a word (possibly as much as ± 10%). It is probably best at comparing similar types of text. There are several newer, potentially more accurate measures of readability. The accuracy and applicability of the metric over 200 years is doubtful given changes in both language and education. So the use here of Flesch-Kincaid requires some explanation.

Flesch-Kincaid remains one of the best known and frequently used metrics of readability. Its correlation with grade-levels gives a simple and accessible sense of the metrics meaning. It is a convenient quantitative marker of style that has broad current acceptance. In examining the historical corpus of the State of the Union, a set of documents with exceptional continuity over time, it provides a measure of the gradual changes in language use.

One interesting finding of the project is in fact that the trend of the score is so consistent and seemingly uncorrelated with particular presidents. From this study it is not possible to determine whether this is a change in language generally or one specific to political language.

The following chart shows Flesch-Kincaid Scores of a variety of texts:

Date Title Author Type Score
1611King James BibleLiterature11.0
1775Give Me Liberty or Give Me DeathPatrick HenrySpeech7.0
1776Declaration of IndependenceDocument15.1
1787US ConstitutionDocument17.8
1788The Federalist PapersAlexander Hamilton, Et. Al.Essays17.1
1850Scarlet LetterNathaniel HawthorneLiterature11.0
1863Gettysburg addressAbraham LincolnSpeech11.17
1865Alice's Adventures in WonderlandCharles DodgsonLiterature6.3
1906What It Means to be ColoredMary Church TerrellSpeech15.0
1906Taxes and MoralsMark TwainSpeech9.1
1914Tarzan of the ApesEdgar Rice BurroughsLiterature9.4
1921The Morality of Birth ControlMargaret SangerSpeech11.5
1922UlyssesJames JoyceLiterature6.8
1929A Room of One's OwnVirginia WoolfLiterature11.8
1932Brave New WorldAldous HuxleyLiterature7.4
1950Declaration of ConscienceMargaret Chase Smith13.7
1954Lord of the FliesWilliam GoldingLiterature4.8
1960To Kill a Mocking BirdHarper LeeLiterature6.0
1963I Have a DreamMartin Luther King Jr.Speech9.4
1964The Ballot or the BulletMalcolm XSpeech7.8
1966In Cold BloodTruman CapoteLiterature7.9
1973New International BibleLiterature13.5
1973Gravity's RainbowThomas PynchonLiterature9.5
1991Statement to the Senate Judiciary Committee on Clarence ThomasAnita Faye HillSpeech8.9
1993New York TimesNewspaper14*
1993LA TimesNewspaper14*
1993Washington PostNewspaper14*
1993Associated PressNewspaper13*
1993Wall Street JournalNewspaper11*
1993NewsweekNewspaper11*
2004Commencement at the U of PennBonoSpeech5.9
2006The (Sorry) State We Are InBrad BorevitzEssay13.7

* Average Score

(Note: There is a rumor that USToday is written at 8th grade level, but I have found no documentation of that, and spot checks seem to indicate that it is probably in the same range—around 12—as other papers.)

(Sources: for Literature Amazon.com, for Newspapers Jack Hart, Editor & Publisher, November 6, 1993 (v126 i45 p.5) quoted at http://answers.google.com/answers/threadview?id=301734, for others original analysis.)

Statistical Methods Used in this Project

The methods used in this project all rely on frequency counts of words. The size of the word is determined simply by the frequency of its occurrence in the document. The x (horizontal) position within the interface is determined by the average position of the word in the document. The y position of displayed words (the vertical) is determined by a calculation of the words Relative Frequency Score, S.

The S score is an attempt to quantify significance as a combination of frequency and a determination of the uniqueness of the words use within an individual document as compared to its use in the corpus as a whole. S is average of two statistics: the Log Likelihood Statistic (LLS), the and the Term Frequency–Inverse Document Frequency (TF-IDF) . Each of these scores has been normalized so that the minimum and maximum values of each component over the corpus are comparable (between 0 and 100).

The formula for the LLS is:

LLS = 2f * ((freqInCorp * log(freqInCorp / E1)) + (freqInDoc * log(freqInDoc / E2)))

where

E1 = wordsInCorp * (freqInCorp + freqInDoc) / (wordsInCorp + wordsInDoc)
E2 = wordsInDoc * (freqInCorp + freqInDoc) / (wordsInCorp + wordsInDoc)

In order to even out the scale of the LLS, the following formula was employed to derive the log of the LLS+1, (L1LLS):

L1LLS = log(LLS+1)

The formula for TF:

TF = freqInDoc/wordsInDoc

The formula for IDF:

IDF = log( docsInCorpus / docsWhereWordOccurs)

The formula for TF-IDF:

TF-IDF = TF * IDF

The formula for S is as follows:

S = (L1LLS + TF-IDF) / 2
(values are normalized before being averaged)

Words are filtered to eliminate the most commonly used words.

For each document, the words with the top 40 S scores are selected. Depending on the length of the address, words with too few occurrences are also filtered out.

Data