Seshat, The Global History Databank

Paper found here. Excerpts below:

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

From Theories to Predictions. The Evolution of Social Complexity.

The theoretical framework that has informed our research is Cultural Multilevel Selection (CMLS). CMLS offers an elegant and parsimonious solution to the problem of how to combine the functionalist and conflict elements. Cooperation and coercion are enjoined in a very special way: cooperation takes place among lower-level units (but is supplemented with punishment of freeriders), whereas conflict takes place between higher-level entities (Richerson and Boyd 1998, Wilson 2002, Bowles 2009, Turchin 2013).
This brief overview shows that we do not lack theories. What we have lacked so far is a process by which we can reject some theories in favor of others. Different theories, however, make very different predictions as to where, when, and under what circumstances we should see the rise of large-scale complex societies during the last 10,000 years across the globe. Because different theories postulate different causal factors responsible for the evolution of social complexity, an analysis of which potentially explanatory variables correlate best with rising social complexity has a direct bearing on the empirical adequacy of rival theories.
Moreover, we can do better than employing a purely correlational approach. Because causes precede effects, different theories make divergent predictions about temporal sequences of events. Thus, we need a dynamical databank that allows us to trace how different characteristics of past societies changed with time and determine which variables’ changes precede changes in other variables. Of course, such an approach will not work in situations where temporal changes cannot be resolved on a sufficiently fine temporal scale; however, the continuous improvement in our methodologies of studying the past gives us hope that there will be a growing ensemble of case studies in which temporal resolution is sufficient for the purpose of testing theories.
In the next section, we make this discussion more concrete by focusing on one particular aspect of the evolution of human societies: the interaction between social complexity, hierarchy, and inequality.

The Evolution of Hierarchy and Inequality in Human Societies.

Egalitarianism and a fierce preference for equality have characterized modern humans for the greater part of our evolutionary history. Although there are distinctions based on age, gender, and achievement, human foraging societies lack clear-cut dominance hierarchies, such as the ones present in chimpanzee troops. Equality in foraging societies is not simply a consequence of their relative poverty; it requires active maintenance. Egalitarian societies possess social norms and institutions designed to control those individuals who attempt to dominate others and to obtain an unfair share of resources (‘upstarts’) (Cashdan 1980). These ‘leveling mechanisms’ range from gossip and ridicule to ostracism and, ultimately, assassination. Because of their small scale, societies of hunter-gatherers were integrated by face-to-face sociality, which enabled a diffuse, non-centralized form of social organization that was well-suited to maintaining an egalitarian ethos (Boehm 2001).

The adoption of agriculture c.10,000 years ago saw the evolution of larger-scale human societies (Flannery and Marcus 2012). More routinized forms of cooperation were required to sustain novel forms of specialized labor, reciprocity, pooling, and storage (Whitehouse 2004, Atkinson and Whitehouse 2010, Whitehouse and Hodder 2010). Agriculture, a sustainable exploitation of the commons, required the dissolution of small-group boundaries and inter-group rivalry in favor of larger-scale forms of collective identity, trust, and cooperation that extended to tens of thousands of individuals (Whitehouse and Lanman 2014), and, ultimately, millions. Computational demands on memory and information-processing systems increase dramatically with the size of the cooperating group (Dunbar and Shultz 2007). When compared with other organisms, humans have evolved to possess a number of cognitive advantages, including a more complex memory, more predictive capacities for future simulation, more fine-tuned systems for correctly identifying a number of individuals, and more precise numerical discriminatory abilities (Axelrod and Hamilton 1981, Fehr and Gächter 2002). As a result, humans are able to cooperate in the largest groups of all primates (Dunbar 1992). Despite these advantages, our cognitive systems are quickly overwhelmed when the size of a cooperating group grows beyond a few hundred.

The nature and quantity of resources available to societies affects how much wealth can be inherited, and whether inequalities can persist over multiple generations (Borgerhoff Mulder et al. 2009). Sedentary agricultural societies are able to produce food surpluses that can be used to support non-productive (‘elite’) members of society, whereas hunter-gatherer societies generally do not produce a surplus (Morrison 1994, Hayden 1995).

However, the most unequal—even despotic—human societies ever were ‘archaic states’ that first appeared c.3000 BCE (Feinman and Marcus 1998). These early states were characterized by extreme forms of structural inequality such as human sacrifice, slavery, unequal rights of commoners and nobles, and deification of the rulers (Trigger 2003, Kirch 2010). Religious and ritual mechanisms that evolved for the legitimation of hierarchy and structural inequality, initially serving the interests of society at large, may have been hijacked by coercive elites and rulers to drive inequality levels to unprecedented heights.

Religion may have played an extremely important, yet little-appreciated role in this second turn. Robert Bellah (2011) has recently argued that a major driver in the evolution of religion was the need to reconcile the tension between the benefits of hierarchy and the need for legitimacy and equity, resulting in the new forms of spirituality associated with the rise of world religions during the Axial Age. One aspect of this change was the first appearance of a universally egalitarian ethic, which was largely due to the emergence of “prophet-like figures who, at great peril to themselves, held the existing power structures to a moral standard that they clearly did not meet” (Bellah 2011).

For individuals to cooperate and coordinate their actions in larger groups, cultural workarounds are required to overcome these cognitive constraints (Richerson and Boyd 2001). The main solution that social evolution found was hierarchical organization, with large human groups integrated by chains of command. A member of a hierarchically organized group needs to have face-to-face interactions with only a few individuals: a superior and several subordinates (Turchin and Gavrilets 2009). This move to more hierarchical forms of social structure entailed formal offices of leadership and the development of hereditary systems of ranking and social class that were very different from the dominance hierarchies in non-human primates (Dubreuil 2008).

From the perspective of CMLS, the evolution of cultural traits, such as religious practices and equity norms, can be modeled as resulting from the action of selective forces acting at different levels of social organization. Costly social institutions that enable large-scale cooperation can evolve and be maintained as a result of competition between societies: societies with traits that enable greater control and coordination of larger numbers will out-compete those that lack such traits (Richerson and Boyd 1998, Wilson 2002, Bowles 2009, Henrich et al. 2010).

Building the Databank

Our long-term goal is for Seshat to be a vast repository for structured data on theoretically relevant variables for any past human society, for which such data exist.

Seshat is a theory-neutral databank; future investigators will also have the freedom to re-conceptualize and redefine variables using their preferred theoretical frameworks.

It should be noted, however, that as Seshat grows by including more variables, it becomes easier to ask additional Big Questions. As we accumulate an increasingly large stock of explanans variables, we need fewer new variables to code. Furthermore, as we develop increasingly sophisticated software for automating the harvesting of structured data from semi-structured and unstructured sources, the effort required to capture the values of new variables will progressively decrease.

In summary, the World Sample-30 was designed with two goals in mind: (1) to include as much variation amongst sampled societies as possible, at least along the social complexity dimension, and (2) to ensure that representation of different parts of the world was maximized.

Information Architecture of the Seshat Databank

The basic entity classes (types) in the databank’s information architecture are Organizations and Territories, each of which has a set of temporally-scoped Variables associated with it. Entity classes serve as units of data collection and entry and as potential units of statistical analysis. Territories have fixed geographical bounds that do not change with time, whereas Organizations are defined by temporal bounds and may be associated with specific Territories at specific intervals. A subset of variables relate Organizations to Territories, and the temporal dynamics of these Relationship Variables allow us to capture both temporal and geographical dynamics of the features of human societies. One of the most important relationship variables is the controls relation that specifies which social organization controls which territories over time.

– Classes of Territory

  • Natural Geographic Region (NGA)

    Defined spatially by the area enclosed within a boundary drawn on the world map. It does not change with time.

  • Free Form Area (FFA)

    Defined spatially by the area enclosed within a boundary drawn on the world map. It can have any dimensions. The purpose of FFAs is to tie various characteristics of societies and organizations to a specific set of geographic coordinates. For example, it is used to indicate what territory was controlled by a certain polity, e.g. Roman Empire, during a particular period of time (from start year to end year).

– Classes of Organization

  • Polity

    Independent political units. Range from villages and local communities through simple and complex chiefdoms to states and empires. What distinguishes a polity from other human groupings and organizations is that it is politically independent of any overarching authority; it possesses sovereignty. Polities are defined spatially by the Territories (NGAs and FFAs) with which they have spatial relationships, e.g., controls.

  • Quasi-Polity

    Quasi-polity is defined as a cultural area with some degree of cultural (including linguistic, if known) homogeneity that is distinct from surrounding areas. For example, the Marshall Islands before German occupation had no overarching native or colonial authority (chiefs controlled various subsets of islands and atolls) and therefore it was not a polity. But it was a quasi-polity because of the significant cultural and linguistic uniformity.

  • Religious System (RS)

    This type of Organization is defined in ways that are analogous to a polity, except it reflects religious, rather than political authority. Religious systems are dynamical and are typically defined by a set of dated boundaries. Religious systems are more likely to overlap in space than Polities.

  • City

    Cities are represented by a single point on the map that doesn’t change with time. Although it is possible to reflect their spatial expansion dynamically, we chose not to do so in the current implementation.

  • Interest Group (IG)

    An IG is a social group that pursues some common interest, so that its members are united by a common goal or goals. Religious systems are also interest groups, but the IG category is broader. It also includes ethnic groups, professional associations, warrior bands, solidarity associations, mutual aid societies, firms and banks (including their pre-modern variants), etc. The IG is defined sociologically, not geographically. However, if desired, a territory may be associated with it in the same way as with a polity or a RS.

  • Sub-Polity

    A sub-polity is an area within a polity for which variable values differ from the overarching polity. It is a general modeling feature for capturing situations in which, for example, provinces or regions within a polity have significant differences in social organization from the rest. A sub-polity is basically a polity that lacks sovereignty.


Seshat Meta-model, showing entity class hierarchy and the most important relationships between entity classes (note: classes with black text are abstract and thus never directly coded; they define sets of variables that are common to all sub-classes). Thin arrows indicate subclass relationships. For example, FFA is a subclass of Territory, and NGA is a subclass of FFA. Thick arrows are examples of relationships.

The current version of the Seshat Code Book has been developed primarily for historical societies. Coding data for societies that are known only archaeologically poses an additional set of challenges. We are currently developing an archeological Seshat Code Book that will address these challenges (Marciniak et al., forthcoming; Palmisano et al. forthcoming).

Coding Procedure

Transforming Raw Data into ‘Facts’

To design statistical tests of hypotheses such as those listed above, we need to define and operationalize such concepts as social scale, inequality, and intensity of military competition. In order to do this, we need to collect and code data systematically and we need to be aware of the complications and complexities involved in such an endeavor. For each area of interest, we have developed a coding sheet that outlines the particular variables to be coded and how they are to be categorized or quantified. These coding sheets have been developed in conjunction with expert historians and archaeologists who have given valuable feedback on how best to classify these phenomena based on their specialist knowledge of these issues. These coding sheets are derived from our overall Codebook, which also discusses some of the complexities involved in coding data of this kind. The current version of the Seshat Codebook is included in the Supplementary Materials.

The historical and archaeological records for even the best-attested societies are incomplete, so we need to deal with the issue of missing data. Therefore, for each variable of interest, we collect data on a number of different measures, and a certain degree of redundancy in these variables is implemented by design. For many past societies (some of which are known only archaeologically), we will not be able to code every variable. Thus, different variables can serve as proxies for the same underlying factor, enabling us to compare different societies even in the face of missing data. For example, estimating populations of historical states and empires is a notoriously difficult problem; as a result, we developed a number of proxies that correlate with population numbers (e.g., the size of the largest urban center, the extent of territory controlled, etc.). Furthermore, our statistical analyses are designed to cope with missing data.

Socio-cultural phenomena described by terms such as ‘social scale’ or ‘inequality’ are often actually multidimensional and cover a range of related, but quite distinct, aspects of the human condition. Confusion or misunderstandings can sometimes occur when researchers have different definitions or conceptions of a particular term that is being used. Intriguingly, Seshat has the potential to uncover that the different dimensions of these aspects of societies may have evolved somewhat independently. For example, structural and reproductive inequality may have seen reductions since the Axial Age, but economic inequality may have been unaffected or even increased.

Seshat Wiki: an Initial Implementation of the Databank

During 2012, our group developed a flexible architecture for this databank. The databank has two distinct aspects: textual/descriptive sections (including references), and coded data. The text-based sections describe what is known about a particular variable based on previous scholarship. These ‘thick’ descriptions provide important context about the variables being addressed, the sources of information used, and make explicit how a decision was made on a particular coding (for a more extended discussion of these issues, see Hoyer and Manning, forthcoming). The coded data are based on the narrative sections, and enable this information to be translated into a form that is amenable to statistical analysis. The databank therefore combines the best features of traditional humanistic and scientific approaches to investigating the past. The databank is implemented as a ‘Wiki’ (i.e., a web-based application that enables an open access, collaborative project), which allows text on different variables to be easily added and updated. The information from the Wiki is ‘scraped’ automatically and translated into formats (e.g., tab-delimited files) that can be analyzed in statistical software packages.

As of the writing of this paper, we have developed a plan to port the Wiki data into an RDF-based triplestore (Schreiber and Raimond 2014). This type of graph-based representation is particularly suitable for capturing rich, structured knowledge about complex entities and multi-faceted relationships between them. In contrast with a traditional SQL-based relational database, graph-based knowledge models, specified in OWL (the web ontology language) or RDFS (resource description framework schema), facilitate constant evolution of schema. In addition to the underlying triplestore, we will use cutting-edge techniques developed by Kevin Feeney, Rob Brennan, and colleagues to further facilitate data collection, curation and use, as described in the next section.

Seshat Databank: Overarching Plan

The design goal is to maximize the research community’s collective intelligence while minimizing overheads. Ultimately, we aim to create a system with easy-to-use software tools that support the following roles and features:

  • Seshat Contributors. Non-technical users, such as historians and archaeologists, can easily add data to the system and update existing data (e.g., using graphical input tools). Ultimately, we would like to make this open and robust enough so that any web user can suggest updates to the datasets in an easy way (crowd-sourcing).

  • Seshat Editors. Data administrators can moderate, correct, and manage the data in the system over time. The downside to harvesting data from a broad community is that inaccuracies, disagreements, user errors, and malicious use must be managed or else the dataset will degrade in quality over time.

  • Seshat Data Architects. Knowledge engineers can make changes to the schema (data structures) over time and manage transitions between versions of the schema without breaking databank integrity (maintaining and assuring the accuracy and consistency of data over its entire life-cycle).

  • Seshat Analysts. Statisticians and mathematical modelers will prepare and analyze time series data to investigate big questions about human societies. Seshat Readers. General end-users can browse, search, download, and view arbitrary slices of the data in a very wide range of attractive and helpful ways.

  • Seshat Administrators. Technical administrators can manage the data curation and publication platform or servers to deal with changes in data, schemata, collection tools and publication formats or tools (e.g., visualizations) over time in a scalable fashion.

Seshat will be managed through the Dacura Linked Data platform (http://dacura.scss.tcd.ie), developed at Trinity College Dublin. Dacura provides support for dataset capturing, curation, and publication (Feeney 2014). Nonetheless, when dealing with a complex and multi-faceted domain, the development of formal schema and tools to facilitate convenient and accurate data input requires considerable experimental and developmental effort. The nature of the Seshat data is such that there are many opportunities to take advantage of maps and timelines to capture the spatial and temporal aspects of the data. However, these tasks are labor-intensive, so the development of the databank system will proceed incrementally. The Seshat databank will be progressively migrated from the current Wiki to the Dacura Linked Data platform in a number of phases, each designed to add functionality to the system, progressively increasing its ability to gather and present high quality structured data, without interrupting researchers’ ability to add new data.

Further Phases. The goal is to progressively replace the Wiki input format with graphical user interface tools which use the schema definition to maintain data quality over time and to ease the task of creating and updating complex instances of data types. We will progressively add enhancements to the publishing capabilities—developing richer interfaces for browsing, visualizing, analyzing, and interlinking the Seshat databank with external sources of knowledge. Once this process is sufficiently advanced, we will switch the authoritative source of data from the wiki to the triplestore, retaining the Wiki as a publishing outlet.

The Future of Seshat

As described in the previous section, at the time of writing this paper, the main focus of the Seshat Databank development team is on coding the NGAs in the World Sample-30 for a set of variables defined by the current list of Big Questions. The variable classes include social complexity, warfare, rituals, agriculture and resources, institutions and equity, and economics and well-being (see the current version of the Code Book in the Supplementary Online Materials for a detailed description of the variables). Simultaneously, we are making a transition from the current Wiki to the Seshat triplestore managed through the Dacura Linked Data platform. Accomplishing these near-term goals will provide us with a solid basis from which to progressively expand the Databank in multiple mutually supportive directions.

As far as new variables are concerned, it would be extremely interesting to add data on the evolution of technology and on linguistic evolution.

Another exciting area of research is gene-culture coevolution, which has, so far, been primarily investigated with theoretical models. There is little reason to doubt that the ongoing methodological advances in ancient DNA will soon enable us to test these theories with data (Callaway 2015).

One direction that we are currently exploring is crowd sourcing—developing software that will support recruitment of volunteers to assist in manual data processing. More radically, we will need to transition to a technology that automates the harvesting of the required variable values from open web-sources, including, but not limited to, the repositories of scholarly publications. Thus, we intend to explore the possibility of harvesting data from such source as JSTOR’s archive of academic publications.

Finally, the value of the Databank will be enhanced by developing support for applying different interpretative frameworks to the underlying data, supporting multiple different views of history, perhaps associated with a particular expert, research team, or research methodology. For example, an estimate of the carrying capacity, which feeds into the estimated population, itself depends on many assumptions of the models used to calculate it. The impact of using alternative assumptions in the models or theories, leading to different interpretations, can then be quickly analyzed and evaluated. The next logical step is to implement a catalog of simulation models, which each can generate predictions about the data based on different mechanisms. As the background data become more accurate and plentiful, simulations could then be re-run by any Seshat user to discover how such improved or alternative data affect model predictions. This novel capability will lead to a more rapid process of theory evaluation, and to quantification of the limits for the application of a given theory.

Despite the current excitement associated with the digital humanities, historians and social scientists have not really begun to utilize the full power of what modern Information Technology can deliver. We believe that the new IT capabilities will eventually (and in the not-too-distant future) transform the field of historical social sciences into what the sociologist Randall Collins termed rapid-discovery science (Collins 1994, although it should be noted that Collins himself was sceptical that such transformation is likely). It is our hope that Seshat: the Global History Databank will be one of the key mechanisms by which such a transformation will be effected.

Written on December 28, 2018