# Basic data and scripts to accompany Hulden, News Diets


*[Warning: the scripts and files are quite messy and hardly documented to a high standard. Yes, one should. No, I didn't.]*

## Set 1: Selected Missouri newspapers 19060601-1907531

The material was downloaded from Chronicling America (all pages from a search in the advanced search tab, https://chroniclingamerica.loc.gov/#tab=tab_advanced_search --- set dates only, all states, no phrases).  Original download was performed in early 2017 for 1906-1910, resulting in about 1.7 million pages. Of these, the paper here uses 06/1906-06/1907. Aleksi Vesanto kindly ran these through the BLAST reprint detection algorithm, developed by Aleksi Vesanto, Filip Ginter, and others at the University of Turku (citations in paper). 

The resulting reprint clusters (in JSON) for the above dates are in **`clusters_1906-07.tar.gz`** (for ALL the papers, not just Missouri.)

The full material (i.e., not only reprint) is in **`full1906-07.tar.gz`**.

The newspapers used in the research reported in the article are:
Affil | Paper title | Publ. frequency | Location
---- | ---- | ----- | -----|
GOP | Iron County Register | Weekly | Ironton, MO
GOP | The Laclede Blade | Weekly | Laclede, MO
GOP | The Holt County Sentinel | Weekly | Oregon, MO
GOP | The Taney County Republican | Weekly | Forsyth, MO
GOP | The Montgomery tribune | Weekly | Montgomery City, MO
GOP | Potosi journal | Weekly | Potosi, MO
DEM | The Butler Weekly Times | Weekly | Butler, MO
DEM | Ripley County Democrat| Weekly | Doniphan, MO
DEM | Mexico Missouri Message | Weekly | Mexico, MO
DEM | The Farmington Times | Weekly | Farmington, MO
AFRAM |The Rising Son  | Weekly | Kansas City, MO
AFRAM | Sedalia Weekly Conservator | Weekly | Sedalia, MO
SOC | Scott County Kicker| Weekly | Benton, MO

The text of those papers for the time period under consideration is in **`missouri190696-190706.tgz`**.

The material was topic modeled with MALLET. For the reprint material, only clusters of texts that appeared in 2 or more different papers were kept; the script **`processnewsjson.py`** extracts the reprint texts and the script **`tsvcull.py`** keeps only the desired texts. Each cluster was a document for the purposes of topic modeling. The full  material, meanwhile, was chunked into shorter texts by four consecutive capital letters (the script that does this is **`chunkstoriescaps.py`**). 

After topic modeling with MALLET, the script **`topiccomp-averages-by-sn.py`** runs the results of the composition file together by newspaper SN (newspaper identifier), making it possible to compare the content by newspaper title. The various calculations are made in Excel; the Excel file is **`topicssummedbysn-mo70-5-avgs.xlsx`**.

## Set 2: Labor and mainstream newspapers 01/1909-12/1911

Material downloaded from Chronicling America (all pages answering to the search with above dates, no search terms, selected papers; search string URL below.) The material was downloaded in 2018, resulting in 25,897 pages. This is in **`laboretal_by_state_ocrtxts.tgz`**. 

The papers in the set are:

Affil | Paper title | Publ. frequency | Location
---- | ---- | ----- | -----|
DEM | Valentine Democrat | Weekly | Valentine, NE
DEM | Little Falls Herald | Weekly | Little Falls, MN
GOP | Fairmont West Virginian | Daily | Fairmont, WV
GOP | Omaha Daily Bee | Daily | Omaha, NE
GOP | Clarksburg Telegram | Weekly | Clarksburg, WV
GOP | Bemidji Daily Pioneer | Daily | Bemidji, MN
GOP | McCook Tribune | Triweekly | McCook, NE
GOP | Colfax Gazette | Weekly | Colfax, WA
LABOR | Labor Journal | Weekly | Everett, WA
LABOR | Labor World | Biweekly | Duluth, MN
LABOR | Labor Argus | Weekly | Charleston, WV
LABOR | Wageworker | Weekly | Lincoln, NE
UNAFF | Princeton Union | Weekly | Princeton, MN
UNAFF | Leavenworth Echo | Weekly | Leavenworth, WA
UNAFF | Ellensburg Dawn | Weekly | Ellensburg, WA
UNAFF | Lynden Tribune | Weekly | Lynden, WA

**Topic modeling**: For topic modeling, this material was then chunked like above (with the same script) by four consecutive capital letters, resulting in 376,704 text snippets. Again as above, after topic modeling with MALLET, the same topic summing script runs the results of the composition file together by newspaper SN (newspaper identifier), making it possible to compare the content by newspaper title. The various calculations are made in Excel; the Excel file is **`topicssummedbysn-laboretal-150.xlsx`**.

**Word embeddings**: The word embeddings analysis is performed on the same data (the pages); the scripts doing that are **`embeddings.py`**, **`embeddings_analyze.py`**, and **`embeddings_analyzecontext.py`**. The latter two need the helper script **`embeddingtools.py`** to function.