I'M GOING TO take a look at the data that The New York Times makes available through its developers API. This metadata stretches back to the paper's beginning in 1851 and a single article's metadata looks something like this:
{ "abstract": "LONDON, May 1.--Roald Amundsen, the Norwegian arctic explorer, has not abandoned his expedition, but after his impending visit to Nome, Alaska, he intends to enter the ice pack around Wrangel Island, off the northern coast of Eastern Siberia, and thence drift across the Polar Sea, says a dispatch to The London Times from Christiania. ", "web_url": "https://www.nytimes.com/1920/05/01/archives/amundsen-to-try-again-for-pole-after-visit-to-nome-the-explorer.html", "print_page": "14", "source": "The New York Times", "headline": { "main": "AMUNDSEN TO TRY AGAIN FOR POLE; After Visit to Nome, the Explorer Hopes to Drift in Iceas He Had Planned.WILL RETURN TO SIBERIA Christiania Report Gives Facts ofFirst \"Battle\" to Force HisWay In Frozen Sea." }, "pub_date": "1920-05-01T05:00:00+0000", "uri": "nyt://article/7d4ee23a-5ac0-5194-958a-c02472911a6f" }
This metadata only includes a short snippet from each news item, not the full text. There is also some information about which part of the paper the piece appeared in, whether it was accompanied by any images, who the author was, etc. As you go further back through the publication's history, there is generally less and less metadata available for each item. The chart below shows the total amount of items that appeared in the paper each year. This tally is not limited to news articles; it also includes things like obituary notices, classifieds, and job ads. The New York Times had fewer items in 2020 than it's had since 1896.
A "news desk" is a section of the newspaper responsible for covering certain topics. Unfortunately the labelling in the metadata that tells us which news desk an item belongs to does not exist prior to ca. 1984. The chart below shows the composition of news items since 1984 across some of the main categories of news. Of note is the radically diminished amount of Finance and Business pieces.
Since elections have been held people have tried to predict their outcomes and newspapers have reported on it. The process of prognostication has been professionalised over time. Below is an early example of such "horse race" coverage:
{ "abstract": "We translate from the Diario, the following speculations upon the results of the Presidential election and the policy of the Democratic party. From the Diario de la Marina, Nov. 11. General FRANKLIN is now President of the United States. His success we did not hesitate to prophecy from the very first, and when General SCOTT was designated as his rival, regarded it as a settled affair ; and strong in this conviction, ", "web_url": "https://www.nytimes.com/1852/11/23/archives/cuba-sentiments-of-the-havana-pressthe-presidential-election.html", "print_page": "1", "source": "The New York Times", "headline": { "main": "CUBA.; Sentiments of the Havana Press--The Presidential Election." }, "pub_date": "1852-11-23T05:00:00+0000", "uri": "nyt://article/39b8297f-5646-5045-9575-2899de76d36a" }
We started out with vibe checks like this, then moved towards quantifying the public's sentiment. In 1936 George Gallup demonstrated the importance of rigorous statistical methodology by beating out Literary Digest's survey (n > 2,000,000) with a much smaller poll (n = ~50,000) to correctly predict the outcome of the presidential election. The Roper iPoll project shows that in 2020 there were over 400 political opinion polls conducted in the US by over 100 different organisations.
I am using this fine-tuned DeBERTa variant for topic classification. This model was trained on data like this:
{ premise : "I first thought that I liked the movie, but upon second thought it was actually disappointing.", hypothesis : "The movie was not good.", entailed : True }
The model learns to predict whether the hypothesis text is "entailed" by the premise text. In this case I am essentially providing some text from each newspaper item (the title and a snippet from the body of the article) as a premise and then asking the model whether the hypothesis "This article covers the topic of political opinion polls" is entailed.✱
The presidential elections in the USA have been held every four years in November since 1788 without fail (even during a major civil war). The chart below shows how the proportion of items that are about political polling varies across different months (Jan-Nov) in election years.
The chart below shows the proportion of items per year that are about political opinion polling from the first election year the paper covered through to 2020.
A well designed poll is usually able to predict the outcome of an election. If you can consult multiple polls, then your ability to predict the outcome is further enhanced. That they are not perfect predictors of outcomes is not the most salient criticism of these surveys.
Public polling conducted by organisations like The New York Times or Gallup might not have a sinister agenda, but it can still be problematic. Newspapers have a limited budget for articles; there can only be so many articles in an edition and there can only be so many articles on the front page of that edition. The more horse-race style articles there are, the less space there is for articles on actual political issues. This contributes to a decline in the quality of political discussion—instead of talking about how power ought to be allocated within our society, we have conversations about which candidate is "electable". Instead of talking about what choices a politician should make for the good of peoples' lives, it is common to talk about what choices that politican should make to ensure the support of some particular group.
Contemporary political campaigns make extensive use of internal polling (i.e., polls whose results are reviewed by campaign strategists and not the public). Candidates can test out which policies and messages are promising through shrewd application of polling and focus group testing before they are released to the general population. If you are a believer in our elected leaders leading us, then there seems to be something topsy-turvy about this. Although the amount spent on such polls is significantly less than a presidential candidate will typically spend on advertising, they are expensive to run and the proliferation of "poll-driven" or "poll-informed" campaigning contributes to the unfortunate high dollar-value barrier to entry in US politics.
Special interest group polling is particularly pernicious. Groups conducting these polls include super PACs aligned with a particular candidate, lobbyists for business, and grassroots environmental activists. Various techniques can be employed to concoct the right kind of data: questions can be framed in a narrow way or with leading phrasing, answers can be constrained to a pre-defined range (e.g., "Agree" or "Disagree", but not "Agree, but with important caveats"), and multiple polls can be conducted and the preferred results cherry picked. If, try as you might, you can't get the results you would like from a poll, you always have the option to not publicise it. The article below reports on some devious pollsters hoping to exploit in-group favouritism.
{ "abstract": " This story was broken on Wednesday by a small newspaper company that publishes two weeklies on Manhattan's West Side - The Westsider and Chelsea Clinton News.And the journalists who did it - Jan Bartelli and Jeff Kisseloff - deserve credit.Now for the story. A professional phone-bank polling firm working for an arm of the Republican National Committee has been calling Jews in New York and California to seek support for the Reagan-Bush ticket.The workers hired to make these calls were using assumed Jewish-sounding names - until Miss Bartelli infiltrated the boiler-room operation and found out.That's when the Republican National Committee expressed surprised horror, and the assumed names stopped.(What hasn't stopped is that the phoners do not tell people they telephone that the call is being made on behalf of the Reagan campaign.)", "web_url": "https://www.nytimes.com/1984/10/13/opinion/new-york-the-gop-goes-for-the-jews.html", "print_section": "1", "print_page": "27", "headline": { "main": "NEW YORK ; THE G.O.P. GOES FOR THE JEWS" }, "pub_date": "1984-10-13T05:00:00+0000", "document_type": "article", "news_desk": "Editorial Desk", "section_name": "Opinion", "byline": { "original": "By Sydney H. Schanberg" }, "type_of_material": "Op-Ed", "word_count": 783, "uri": "nyt://article/1c571401-e4c9-532e-a4bd-ce8f0631b2df" }
Clearly the comissioners of these polls believe these results can help them pursue their agenda or they wouldn't be paying for them in the first place. If you can provide quantification as part of your argument, you lend it a certain kind of credibility regardless of how spurious your methods in gathering that data may have been. If a majority of Americans apparently support something, then maybe you should as well. ✹
✱
My classification approach is fairly inaccurate (generally it produces a lot of false positives). For example, while the article below (which was flagged as being about political opinion polling) is clearly horse-race coverage, political opinion polling is never explicitly mentioned.
{ "abstract": "HARTFORD, Oct. 31.--The active work of the campaign has closed for the Republicans much more brilliantly than it opened. It is certain that not since 1860 has so much earnestness been exhibited in every part of the State, ... ", "web_url": "https://www.nytimes.com/1880/11/01/archives/connecticut-not-doubtful-a-good-majority-for-garfield-assured-the.html", "headline": { "main": "CONNECTICUT NOT DOUBTFUL.; A GOOD MAJORITY FOR GARFIELD ASSURED --THE BOURBONS FLOODING THE STATE WITH MONEY." }, "pub_date": "1880-11-01T05:00:00+0000", "uri": "nyt://article/6e012381-240e-545e-a52a-8852f76fa2f9" }
Manually labelling a dataset and fine-tuning the model further would improve accuracy somewhat. Discerning which articles are about political opinion polling is a suprisingly difficult task that even GPT-4 cannot perform reliably (some challenges include the historic usage of the word "poll" to refer to the actual vote itself, polls that are conducted for purposes other than making predictions about elections, and the fundamental problem of what does it mean for an article to be "about" something). Prediction accuracy would be improved by having access to full articles rather than just snippets.