Statistical Analysis of Social Networks
This page is part of an online textbook by Jacob Apkarian (Department of Behavioral Sciences, City University of New York, York College) and Robert A. Hanneman (Department of Sociology, Emeritus, University of California, Riverside). Feel free to use and distribute this textbook, with citation. Your comments and suggestions are very welcome. Please send all correspondence to Jacob Apkarian.

Chapter 1. The Social Network Perspective

The statistical analysis of social networks is a specialized application of the general ideas of describing distributions, estimating parameters of those distributions, and testing hypotheses about those parameters. We’re assuming that readers already have knowledge of these general ideas. So, to get started, let’s first get an understanding of what makes the application of statistics to social network data “special.”

1.1 Psycho-metrics, Econo-metrics, and Socio-metrics

At the introductory level, the applied statistics that are taught in all of the social science disciplines are pretty much the same.  Students learn to describe the distributions of scores on variables measured across independently sampled cases.  The notions of association, partial association, and inference from sample to population are learned.  Training at the “intermediate” level is also very similar across the social sciences, where almost everyone gets a heavy dose of applications of generalizations of the general linear model for testing hypotheses about relations between variables.  But, beyond this point, each of the social sciences has developed applications that are quite distinctive and attuned to needs of particular subject matter emphases.

Over-simplifying, “psycho-metrics” responds to the challenge of attempting to systematically and reliably assess latent (mental) states that cannot be measured directly.  Psychometricians have developed highly sophisticated tools for working with multiple indicators, factors, and scaling.  Also over-simplifying, “econo-metrics” responds to challenges of distinguishing signal from noise, characterizing trends, and assessing causal hypotheses from observational (rather than experimental) data.

“Socio-metrics” as a special flavor of formal and quantitative analysis has existed for quite some time (Moreno, 1951).  Socio-metricians deal with the special problems and issues that arise when the units of analysis, across which variance is distributed, are relations between social actors, rather than attributes of individual actors.

Most graduate students being trained in quantitative analysis for Sociology learn at least the basics of the special tools of psycho-metrics and econo-metrics.  Oddly, few learn anything about “socio-metrics.”  But, this is changing with the growing popularity of social network analysis (SNA), along with the convergence and cross-fertilization of interest in complex networks in many disciplines. The purpose of this book is to provide an introduction to thinking statistically about data that describe social relations, rather than social actors.

1.2 The Social Network Perspective

The use of graphs to represent relational data is commonplace in a wide range of sciences. The formal analysis of graphs has a very long history in mathematics and the use of statistical methods to analyze relational data has become particularly important and commonplace in physics and bio-sciences. Social scientists borrow from (and in a few cases, contribute to) these rich histories. The application of graph theory and the statistical analysis of relational data in the social sciences have a particular flavor, due to the subject matter and the theoretical questions of interest in these disciplines.

Social network analysis differs from the mainstream methodological tradition in most of the social sciences, which emphasizes the analysis of individual cases and their variable attributes (rather than relations) and experimental planned comparisons (rather than uncontrolled observational data).

Most social scientists are well versed in the more main-stream “independent-cases” and “relations-among-variables” approach to statistical analysis.  Statistical methods for relational data adapt and use most of the same ideas, but with particular emphases.  It’s worth taking just a few minutes to get a sense of the distinctive flavor of relational, rather than variable/attribute analysis.

1.2.1 Focus on Relations

Social network analysis seeks to identify (and describe, and predict) regular patterns in the statics and dynamics of relations among social actors.  The actors are most often individual humans, but they might also be populations, organizations, or symbols and cultural categories.

The emphasis on the “social,” of course, is what social science is about. The emphasis on “relations” is another way of saying that what is of primary interest are “structures” composed of multiple individuals, and not the individuals themselves. Sociologists often use the adage that “sociology is not about people,” by which they mean that the subject matter is regularities of social structures, not individuals.

In mainstream statistical methods, the most common approach is to examine distributions of, and associations between scores on variables, measured across individuals.  Statistical methods for relational data also examine distributions of, and associations between scores – but the scores describe the relations between individuals, rather than attributes of each individual.  Relational methods examine the distribution of relations, measured across pairs of individuals.

Let’s suppose that we had a sample of 10 people that we were observing.  For a variable-oriented analysis, the sample size is 10.  We assume, for inferential purposes, that the observations are independent; or, alternatively, that we can specify the non-independence as some form of correlated error.

In a relational analysis of observations on the same 10 people, we have a “sample size” of 45 if we assume that the relations are symmetric or bonded ((10 * 9)/2), or a sample size of 90 if the relations are asymmetric or directed.  That is, the unit of observation is the relation between pairs of individuals, not the individual.  Obviously, these observations are not independent as multiple relations are “nested” within persons.  That is, the same person (or node) is part of many of the observations.

But, there is a critical conceptual difference.  Relational analysis is all about describing, testing hypotheses about, and modeling social structures, or relations between actors.  Conventional variable analysis is all about describing, testing hypotheses about, and modeling relations among attributes.

1.2.2 Relations and Attributes

Social network analysis is not a substitute for attribute/variable-oriented analysis.  SNA is an additional perspective that is used in conjunction with attribute/variable analysis.

For many research questions, network influences are seen as a cause or predictor of individual attributes.  For example, the happiness of one’s friends may influence one’s own happiness.  For other research questions, networks can be seen as the result of individual attributes.  For example, people who are happy may be more likely to initiate friendship relations with others.  As the example suggests, sometimes networks and attributes may determine one another.  Individual differences may select for patterns of building networks, while individual differences may also be modified by network influences.

In addition to research questions that combine both attributes and relations, there are also some questions that may be purely relational.  Do the patterns in a social network (e.g. who are friends with whom) influence patterns in another relational network of the same actors (e.g. who seeks advice from whom)?  Do features of a pattern of social ties at one point in time affect the pattern of ties at a later point in time?

As we move through the chapters that follow, we will look at techniques for addressing questions where relational data are independent and dependent with attribute (variable oriented) data, as well as techniques for examining association among relational data.

1.2.3 Statics and Dynamics of and on Networks

Social network analysts commonly distinguish between “dynamics on a network” and “dynamics of a network” (or, somewhat ambiguously, “network dynamics”).

Dynamics on a network assume that the relational variable(s) are fixed and affect changes in attributes.  For example, the attitudes of actors who are more central in a network may be expected to be more influential on the attitudes of others than are the attitudes of actors who are more peripheral.  The pattern of social relations among actors is being seen as a determinant of how the attributes of actors are related.

Dynamics of networks focus on the change in pattern of relational ties itself.  The outcomes to be explained are the making and breaking of relational ties among actors.  As ties are made or broken, the network changes, or becomes dynamic.  Analyses of the dynamics of networks are central research questions in the broader fields of complexity and network science.  In the study of social networks, the dynamics of networks is the study of how social structures change.  Changes in structure may be due to inherent tendencies in structures themselves and/or due to the attributes of the actors embedded in those structures.

The statistical analysis of social networks has tool-kits for examining both dynamics on networks (usually observed as one cross-section) and tools for examining the dynamics of networks (usually observed as a time-series or fully time-continuous set of changes in relations).

1.2.4 Multiple Levels of Analysis

In variable oriented statistical analysis the individual cases or observations may sometimes be seen as “embedded” in “contexts.”  This implies a degree of “non-independence” among cases which can be conceptualized as occurring within multiple levels of analysis.  It is helpful to draw a distinction among three rather different ways in which multiple levels of analysis commonly enter statistical analysis of attribute/variable data.

First, sometimes cases are part of (network analysts would say “affiliated with”) larger social units.  Individual students may be nested within classrooms that are nested within schools that are nested within districts or neighborhoods.  Cases like this are not wholly independent observations, and mixed-models or multi-level modeling methods are often applied.

Second, cases may not be independent of other particular cases due to co-existence in some local space.  In geo-spatial statistics, the attributes of a spatial area may be correlated with the attributes of adjacent spatial areas either because the boundaries of plots are arbitrary and the variables are continuously distributed in space, or because of omitted variables that have “local” influences.  Spatial auto-regressive and spatial auto-correlation modeling is sometimes used when this type of non-independence exists.

Of course, the effects of adjacent cases on a focal case need not be geo-spatial; the effects may be social-spatial, or due to adjacency in a social network.  Statistical methods for spatial autocorrelation and autoregression can also be applied to cases that are at known “social distances” from other cases. In the above instances, statistical corrections are used to remove non-independence due to the embedding of social actors in some higher level geographic or social space.

Third, cases might be thought of as being non-independent because they share the same or similar scores on some variable or attribute.  Two persons who both identify as women might be thought to be non-independent because of the influence of this common attribute.  This kind of non-independence is, of course, at the core of variable-oriented analysis.  Here social actors can be viewed as being embedded in higher level social categories.

Social network analysis recognizes this type of non-independence in two rather different ways.  If two nodes in a network share an attribute (say, both identify as women), network analysts would often look for “homophily” effects in their relational data.  That is, the fact that two nodes have the same or a similar attribute might be hypothesized to affect the likelihood that there is a social network tie between them.  Also, if there is a tie from one to the other, that it is likely to be reciprocated.  Two nodes that are “closer” to one another in a network might also be likely to influence one another in the direction of becoming more similar (if the attribute in question is mutable).

Alternatively, network analysts might treat the non-independence of cases due to a common attribute as a “two-mode” network problem.  We won’t deal with approaches based on this way of thinking in this text, but the idea is straightforward.  In the two-mode way of thinking in network analysis, cases may be one of the modes and variables the other.  Associations between variables are observed when cases share the attributes.  For example, if being older and being female are associated, it is because some cases that are more likely to be affiliated with the category “old” are also more likely to be affiliated with the category “female.”  Simultaneously, two cases are closer, or more similar, or share common affiliations if each case is tied to “old” and to “woman.”

Many variables/attributes-oriented analysis questions have complexities arising from non-independence of observations, particularly in observational rather than experimental data.  Many of these complexities can easily be seen as arising from the embedding of cases in networks.  So, one important set of issues to be dealt with in the text that follows is how to do “conventional” or “variables and attributes” analysis in the presence of network embedding.  It is useful to think of these kinds of problems as multi-level problems where cases are embedded in a network.

Many network analysis questions are also usefully cast as multi-level problems.  Social networks are structures (patterns of relations among cases) that arise out of the agency of the individual social actors.  To understand the dynamics on networks, and the dynamics of networks, the attributes of actors almost always need to be taken into account.  Actors with different attributes (i.e. different scores on variables) are likely to have different networking behaviors.

The social network analysis perspective takes structures, rather than individuals, as its central concern.  But, the perspective is inherently multi-level.  The clearest statement of this idea continues to be that of Ronald Breiger (1974).  Variables/attributes analysis takes individuals as its central concern.  In many cases, though, it is also multi-level; individual cases are not independent of one another because of their embedding in social networks, contexts, or categories.

Socio-metrics, then, is a distinctive branch of quantitative analysis because it focuses on structures, or relations.  But, it cannot be separated from variables/attributes analysis.   Sociologists should re-cast their thinking about “conventional” statistical analysis of case-wise data to be explicit about how they treat structural effects.

1.3 Organization of the Book

Our plan for the book assumes that the reader is reasonably comfortable with conventional statistical analysis (i.e. the analysis of distributions and joint distributions of variables, measured across cases).  One of our goals is to show how the analysis of variables (or “attributes” in network jargon) can be connected in powerful and useful ways with the analysis of relational data.  Relational data are also used to address certain research questions and hypotheses that are unique to the social network perspective.  Several of our chapters will focus on approaches and techniques for testing hypotheses about networks as the outcome to be explained.

We will begin (in Chapter 2) with a brief look at how social network data are structured.  From a strictly mathematical point of view, there is nothing all that unusual about relational data – relational data are simply collections of matrices and vectors.  But social network analysis does have a specialized language, and draws some analytical distinctions among types of data that are useful in helping to translate substantive problems into formal statistical analyses. There are many different software systems for working with network data, and they do vary in the details of how data are prepared for analysis (Huisman and van Duijn, 2011).  Fortunately, most data structures are quite simple and it is usually easy to move data from one application to another -- which is often necessary.

We only briefly touch on descriptions of univariate distributions of variables (i.e. attributes) in Chapter 3. Descriptions of (and hypothesis tests about) univariate distributions are important, but covered in any basic courses in conventional statistics.  We also won’t spend much time discussing the description of and hypothesis testing about the univariate distributions of relational data.  Network analysis generally, and social network analysis, particularly, have developed an extremely large number of tools for characterizing all kinds of interesting things about the shape and texture of a network.  There are now a number of useful sources (including Wasserman and Faust, 1994 and Hanneman and Riddle, 2005) that cover these issues.  We will spend a little time reviewing the ideas of degree-distributions and triad-censuses in our later chapters, as these are critical to understanding the theory underlying exponential random graph theory.

Chapters 3 and 4 discuss measures of association and tests of significance when dealing with “monadic” (that is attribute or “variables”) data and “dyadic” (that is, network or structural level) data.

Chapter 3 takes a look at conventional attribute/variable analyses with more explicit attention to the problem of non-independence of observations that arises from network embedding.  We will be looking at some approaches to understanding the association between two attributes/variables when the cases are drawn from a network rather than from independent sampling.

Chapter 4 looks at some simple approaches to studying the relationship between two networks, or the association between two dyadic or relational variables.  Actors may have multiple forms of social ties that covary (e.g. both friendship and authority relations).  Similarly, we may have panel data on a social relation and be interested in the correlation between earlier and later observations of the structure.

Chapters 3 and 4 look at how we study association between two attributes and between two relations, respectively.  In Chapter 5, we take the next logical step by examining the association between an attribute and a relation.  All three of these chapters focus on symmetric association rather than prediction and modeling of hypothesized causal relations.

In Chapter 6 we shift our focus to the study of asymmetric association, which predicts, or models hypotheses about causal effects.  Chapter 6 focuses on the analysis of “network influence.”  That is, how are the attributes of an actor (i.e. the scores of cases on variables) affected by the ways that the node is embedded in a network, and the attributes of the “alters” to which each “ego” is connected?  A very wide range of important substantive problems in sociology deal with questions of these kinds of “social influences.”  Do the attitudes and behaviors of those with whom I interact affect my attitudes and behaviors?

Chapter 7 turns the prediction problem around:  how can we use individual’s attributes to predict the ways in which they become embedded in a network?  This kind of problem is often called “network selection.”  That is, how do the attributes of social actors shape the ways in which they make or break social relations to others – and, in the process, “select” one possible emergent network instead of another?  In “network selection” problems, the relation or network is the dependent variable.  This is a rather new way of thinking about things for many readers.  So, Chapter 7 will spend some time looking at how SNA theorizes the processes that create networks.  These ideas become quite important in understanding the remaining chapters.

In “conventional” statistics, the primary tool for problems involving the prediction and modeling where there are multiple hypothesized causal influences and need for statistical control is the generalized linear model.  In Chapter 8, we tackle the same problem for networks, rather than attributes, as outcomes.  At the time of the writing of this text, there are two somewhat related – but not yet fully integrated – approaches to multiple-variable prediction of networks as outcomes.

Specialists in the statistical analysis of social network data have developed a quite distinctive approach to relational variable outcomes based on the underlying theory of “exponential random graph” development.  These approaches place an emphasis on using the structural tendencies of social networks (for example, the tendencies toward “reciprocity” or “closure”) as predictors in explaining complex patterns of relations.  The field of “exponential random graph” modeling is a distinctive approach to the analysis of relational data that is firmly grounded in social science theory, and underlies the analysis of network development and co-evolution that are discussed in Chapter 9.

The prediction of network relations as outcomes, however, can also be cast as a rather straightforward general linear mixed-model type of problem in which relations between two actors are nested in the cross of the two actors (and their attributes, as well as the attributes of the dyad).  At the time of this writing, the mixed-models approach to network data has the comparative advantages of dealing with relations that are measured at the nominal, ordinal, or interval-ratio levels; exponential random graph models, to date, deal with binary outcomes.  Mixed models are also familiar to many analysts, and integrate with a wide body of approaches to complicated data structures.  But, so far, mixed models approaches to relational data do not have the underpinnings of SNA theories of where social structures come from and do not deal easily with the issues of structural effects and complex underlying distributions that vary with graph density – the great strength of exponential random graph models.

In Chapter 9 we take a brief look at two very important areas at the “cutting edge” (at the time of this writing) of modeling in the exponential random graph tradition.  Exponential random graph theory is particularly useful as a statement of how social relations develop and change over time as actors select network structures by making and breaking social ties.  Sometimes SNA data have repeated cross-sections (or “panels”) of observations on the patterns of ties among the same actors as they change over time.  Some models have been developed (“Sienna”) specifically for studying network development of “evolution.”

The earlier chapters developed two related themes.  On one hand, networks develop and are shaped (i.e. one network is “selected” instead of another) by choices made by actors in forming or breaking ties.  These choices may be “biased” by the attributes on the actors.  On the other hand, some attributes of the actors making these choices may be influenced or shaped by the attributes or behaviors of the “alters” to which each “ego” is connected.  For example, a student might experiment with drugs because his/her friends do, and consequently drop some friendship ties with non-user friends and make new ties with others who are drug users.  That is, the attributes of an actor (e.g. being a drug user or not) may “co-evolve” with their position in the network (e.g. the likelihood of having friendship ties with others who use drugs).  At the cutting edge of statistical applications in network analysis are some “Sienna” models that treat both actor attributes and relations as joint outcomes of joint processes of “network selection” and “network influence.”

1.4 Summary

Applied statistics in the social sciences have a common set of core concepts and techniques that differ little across the disciplines.  Several of the disciplines have also developed more specialized emphases that address problems that are particularly common in the types of data that arise from the research designs and measurement methods that the disciplines often employ.  Psycho-metrics emphasizes the use of multiple measures to assess underlying traits that are not directly observable.  Econometrics emphasizes approaches to time-series and multiple-time series of observational data.  Both of these branches take the individual case as the unit of analysis.  The distinctive feature of socio-metrics arises from its emphasis on the relation between cases, rather than the attributes of individual cases, as the unit of analysis.

SNA is a particular application of the analysis of relational structures to patterns of ties among social actors.  SNA has developed quite a large toolkit for the description of the distributions of relations among actors, such as degree-distribution, centrality, clustering, path-length, etc. (see, for example, Hanneman and Riddle, 2005; Kadushin, 2012; Wasserman and Faust, 1994).  Additionally, a great deal of work has been done over the past 50 years on modeling and hypothesis testing of social network data.

Much of the work on statistical analysis of social networks focuses on relationships among two or more networks, or the change in a single network over repeated observations (the dynamics of networks).  Other work integrates the analysis of network data with the analysis of data on the attributes of the individuals who make up (or are “embedded in”) the network.  Sometimes the network plays the role of independent variable in analysis of how the attributes of other actors influence the attributes or behavior of a focal actor.  Sometimes networks are taken as the dependent variable; i.e. the selection of a particular pattern of relations among the actors is seen as arising from the attributes of the embedded actors.  Recent work has explored the “co-evolution” of the distributions of individual actor attributes and distributions of dyadic (or relational, or network) ties.

The text will first introduce the most common data structures used in the statistical analysis of social network data.  It will then explore the analysis of attribute data, when the cases being observed are embedded in a network.  We start with the analysis of symmetric bivariate association, move to asymmetric association in which either the network or the attribute may be the dependent variable.  From there, we move to a more extended treatment of approaches examining the relations between actors as the outcome of interest.

1.5 References

Brieger, Ronald L. 1974. “The Duality of Persons and Groups.” Social Forces 53(2): 181-190.

Hanneman, Robert A. and Mark Riddle. 2005. Introduction to Social Network Methods.  http://faculty.ucr.edu/~hanneman/nettext.

Huisman, Mark and Marijtje A J Van Duijn. 2011. “A Reader’s Guide to SNA Software,” Pp. 578-600 in The SAGE Handbook of Social Network Analysis, edited by J. Scott and P. J. Carrington. London: Sage.

Kadushin, Charles.  2012. Understanding Social Networks:  Theories, Concepts, and Findings.  Oxford: Oxford University Press.

Moreno, Jacob L. 1951. Sociometry, Experimental Method, and the Science of Society.  Ambler, PA: Beacon House.

Wassermann, Stanley, and Katherine Faust. 1994. Social Network Analysis: Methods and Applications. Cambridge, UK: Cambridge University Press.