The accompanying data and code is a companion to "Synthesis and Large-Scale Textual Corpora: 
A Nested Topic Model of Britain’s Debates over Landed Property in the Nineteenth Century" (Guldi and Williams, 2018).

This code is used to hierarchically nest 4 LDA topic models and produce sunburst diagrams of the hierarchies as Shiny apps. 
This zip file contains the necessary data and code to reproduce the results given in the above article.

SOFTWARE:

The LDA topic models were constructed with MALLET (http://mallet.cs.umass.edu/topics.php). 
The remainder of the analysis uses the R programming language (https://www.r-project.org/). 
The authors used the IDE RStudio (https://www.rstudio.com/products/rstudio/) and highly recommend it.

When using RStudio, users may begin a new project, and put all of the code and data in this project directory. Then, the user does not need to specify the working directory, as R will recognize the working directory as the project directory. File names will only need to be quoted when used in the R code, just as they are in the attached scripts.

DATA: 

4 LDA topic models of nineteenth-century debates of Great Britain’s House of Commons and House of Lords, colloquially known as Hansard 
(https://api.parliament.uk/historic-hansard/index.html):
	The raw mallet file of the corpus is in the zip file and is called “stemmed_20170517_3_2_m.mallet"
	A topic model with 4 topics is in the zip file and is called "topic_state_all_hans3_2_stops_4.gz"
	A topic model with 20 topics is in the zip file and is called "topic_state_all_hans3_2_stops_4.gz”
	A topic model with 4 topics is in the zip file in 2 files and is called "topic_state_all_hans3_2_stops_4.gz" 
	A topic model with 4 topics is in the zip file in 2 files and is called “topic_state_all_hans3_2_stops_500.gz"

The hierarchy as determined by scholars is in the file “500_topics_first_try.csv”. There are 2 columns in this dataset, the first is the initial three topics in path of the hierarchy for some specific topic from the lowest level. That is, the path is made up of: the name of a root topic, followed by the name of a bough topic, which is followed by the name of a branch topic. The second column is a number, which represents the number of leaf topics in that specific hierarchy. 

The scholar assigned topic names are in a multi-sheet excel file called “topic_names.xlsx”.

The colors used for the sunburst diagrams are in a csv file called “colors2.csv”. The color of each topic is consistent between the sunburst diagram of the automated hierarchy and the sunburst diagram of the handmade hierarchy.

“lev.sunburst.csv” is a csv written by “nesting_full_hansard_topics.R” which is necessary for “naming_computer_nested_topics.R”. It is included in the zip file, but also is an output from the code below.

“lev_sunburst_with_names.csv” is a csv written by “naming_computer_nested_topics.R” which is necessary for the sunburst diagram. It is included in the zip file, but also is an output from the attached code.

"js.topic.dists.all.norm.csv" is a csv written by “read_in_mallet_models.R” which is an input in “nesting_full_hansard_topics.R”. It is included in the zip file, but also is an output from the attached code.

CODE: 

First, run “read_in_mallet_models.R”. This script is used to read in the MALLET topic model objects to R. For the 500 topic model of Hansard, the file “big_load_mallet_function.R” is called upon. “big_load_mallet_function.R” adjusts a necessary function to handle a large dataset. This script also writes a csv files of normalized topic distances between each topic called "js.topic.dists.all.norm.csv". The divergence metric used is Jensen Shannon divergence. This script will take a long time to run, especially on a personal laptop. The authors recommend running this code overnight or on a computer cluster.

Next, “nesting_full_hansard_topics.R” is run to hierarchically nest the topic models. This writes a csv called “lev.sunburst.csv” which is used to make the sunburst diagram. 

Then, “naming_computer_nested_topics.R” assigns the scholarly assigned name to topics which were hierarchically nested via Jensen-Shannon divergence.

Finally, “app_comp.R” and “app_hand.R” deploy shiny apps depicting the sunburst diagrams for the computer/automatically nested hierarchy and the scholarly nested hierarchy, respectively. In the commented section of these 2 scripts, there is code to just make a sunburst diagram rather than shiny apps of the sunburst diagrams.