17 Complex survey design
17.1 Complex sample design
You will rarely find large-scale social surveys that apply a simple random sample because they can be impractical to administer and can result in imprecise estimates, especially for certain groups in a population. Stratified sampling is an alternative probability sample that can increase precision by dividing the population up into groups or strata and sampling from each. Stratification ensures that a prespecifed number of observations from each stratum end up in the sample, therefore the sample is less varied and gives more precise estimates. Stratified sampling is feasible for telephone or mail questionnaires, but for in-person interviews it has the problem that the sample is spread out over the entire population. Cluster sampling, sampling of a relatively small number of groups of people, is often used when interviewers are required to attend respondents in person to save time and cost. In contrast to stratified sampling, cluster sampling decreases precision for a specified sample size, but can increase sample size and precision for a specified cost.
The Health Survey for England (HSE) and Scottish Health Survey (SHE) are both drawn in two stages. In HSE, at the first stage is a random sample of primary sampling units (PSUs), based on postcode sectors (i.e. the first 3-4 characters of a UK postcode), was selected. Within each selected PSU, a random sample of postal addresses was then drawn. To maximise the precision of the sample, PSUs were selected using stratified sampling. The list of PSUs in England was ordered by local authority and, within each local authority, by the percentage of households in the 2001 Census with a head of household in a non-manual occupation (NS-SEC groups 1-2).
The sample of PSUs was then selected by sampling from the list at fixed intervals from a random starting point. 900 PSUs were selected with probability proportional to the total number of addresses within them. Selecting PSUs with probability proportional to number of addresses and sampling a fixed number of addresses in each ensures that an efficient (equal probability) sample of addresses is obtained. Once selected the 900 PSUs were randomly allocated to one of two groups; 720 PSUs were allocated to the core sample and 180 PSUs were allocated to the additional child boost sample. The core PSUs contained sampled addresses for both the core and child boost sample, the additional child boost PSUs only.
A similar clustering and stratification is applied in the SHE.
17.2 Introducing the R survey
package
The survey
package has two main purposes. The first is to bind the necessary design metadata to the data so that the correct analysis adjustments can be performed reliably and automatically. The svydesign
function takes this description and adds it to the data set to produce a survey design object. The survey design object is then used in all analyses. Wrapping the design information and the data up in a single object ensures that the design information cannot be inadvertently separated or used with the wrong data set.
Let’s first consider how to specify a weighting variable. Non-response weights are calculated to adjust for non-contact and for refusals of entire sampling units (e.g. households). Values above 1 indicate a respondent is underrepresented and values below 1 indicate a respondent is overrepresented. There are number of weight variables in hse_and_shes12
dataset. We will use wt_GAP
, which is non-response among individuals in responding households for a self-completion questionnaire on gambling. You will often find there are different weight variables depending on the subsample you are using in a dataset. For example, the main HSE dataset has separate weights for respondents who completed a main interview, nurse examianation and gave a blood sample to name but a few. Before creating the survey design object we need to remove respondents without a valid weighting variable value, which are coded as non-applicable (-1)
hse_and_shes12.weighted<-subset(hse_and_shes12, wt_GAP>0)
Now we create the survey design object, where the id equals to the primary sampling units (here we effectively say there are none by specificying the respondent id as the PSUs).
library(survey)
srs_design <- svydesign(id=~PserialC, weights=~wt_GAP, data=hse_and_shes12.weighted)
srs_design
#> Independent Sampling design (with replacement)
#> svydesign(id = ~PserialC, weights = ~wt_GAP, data = hse_and_shes12.weighted)
Once you have your survey design object, many statistical functions (e.g. mean) can be executed by prefixing svy
.
17.2.1 Summary statistics
Let’s estimate the mean of age assuming a simple random sample and without adjusting for non-response weight (i.e. the value you would get if the ask R to recall the mean). We will use describe
from the psych
package beacuse it will recall the standard error of the mean as well as other summary statistics.
library(psych)
describe(hse_and_shes12.weighted$age)
#> hse_and_shes12.weighted$age
#> n missing distinct Info Mean Gmd .05 .10
#> 11774 0 84 1 50.4 21.02 21 25
#> .25 .50 .75 .90 .95
#> 36 50 65 75 80
#>
#> lowest : 16 17 18 19 20, highest: 95 96 97 98 99
Now see what happens when we apply the weight variable.
svymean(~age, srs_design)
#> mean SE
#> age 46.5 0.22
Notice that the point estimate of the mean and its standard error have changed. The mean is now lower, reflecting the fact that fewer younger people responded to the survey.
Next we will start to take into consideration the sample design, starting with the clustering by primary sampling unit (psuC
). We replace the id with the psuC
variable. And then we will report the age taking into account survey clustering. N.B. not all surveys that use complex survey design will provide data on PSUs, especially if they identify small geographical areas. Why do you think that is? Invariably, large-scale government surveys will provide non-response weights in file availble on the UK Data Service.
clus_design<-svydesign(id = ~psuC, data = hse_and_shes12.weighted, weights = ~wt_GAP)
svymean(~age, clus_design)
#> mean SE
#> age 46.5 0.32
The estimate of the mean remains identifical to when only the weighting variable was applied, but the standard error (and therefore the confidence intervals of a point estimate, such as the mean) are much bigger.
You can also run frequency tables using the svy prefix. Let’s try producing a one-way table of Sex
.
svytable(~Sex,clus_design)
#> Sex
#> Refused Don't know Schedule not applicable
#> 0 0 0
#> Item not applicable Male Female
#> 0 5755 6019
Compare the table produced under the assumption of a simple random sample with adjustment for the sample weights.
table(hse_and_shes12.weighted$Sex)
#>
#> Refused Don't know Schedule not applicable
#> 0 0 0
#> Item not applicable Male Female
#> 0 5188 6586
What does the weighting do to the sex distribution? Did you expect that?
You will have noticed there were a number of redundant levels within the Sex
variable (e.g. “Refused”). You can remove these by respecifying a variable as the factor.
levels(hse_and_shes12.weighted$Sex) <- factor(hse_and_shes12.weighted$Sex)
This will not change the levels within your survey design object. You will have to create it again to update the levels of sex.
You can also run two-way tables using the svytables
function or calculate means and proportions by other variables using the svyby
function. The following code runs crosstabulations of sex and whether a respondent has ever been diagnosed with high blood pressure.
svytable(~Sex+everdi,clus_design)
#> everdi
#> Sex Refused Don't know Schedule not applicable
#> Refused 0.000 0.000 0.000
#> Don't know 0.000 0.000 0.000
#> Schedule not applicable 0.000 0.000 0.000
#> Item not applicable 0.000 0.000 0.000
#> Male 0.924 2.098 0.000
#> Female 0.000 1.204 0.000
#> everdi
#> Sex Item not applicable Yes No
#> Refused 0.000 0.000 0.000
#> Don't know 0.000 0.000 0.000
#> Schedule not applicable 0.000 0.000 0.000
#> Item not applicable 0.000 0.000 0.000
#> Male 0.000 379.673 5371.922
#> Female 0.000 344.297 5673.881
svyby(~Sex, ~everbp, clus_design, svymean)
#> everbp SexRefused SexDon't know SexSchedule not applicable
#> Don't know Don't know 0 0 0
#> Yes Yes 0 0 0
#> No No 0 0 0
#> SexItem not applicable SexMale SexFemale se.SexRefused
#> Don't know 0 0.630 0.370 0
#> Yes 0 0.464 0.536 0
#> No 0 0.496 0.504 0
#> se.SexDon't know se.SexSchedule not applicable
#> Don't know 0 0
#> Yes 0 0
#> No 0 0
#> se.SexItem not applicable se.SexMale se.SexFemale
#> Don't know 0 0.13386 0.13386
#> Yes 0 0.00995 0.00995
#> No 0 0.00583 0.00583
17.2.2 Regression
We can also use the our survey design object to adjust regression models for non-response and cluster sampling. Let’s return to the final model we fitted last week predicting mental wellbeing on age and sex and an interaction between these two terms. We will use the glm
function because lm
is not available in the survey
package. The glm
function enables you to fit a whole suite of models with different dependent variable types (e.g. binary, multinominal and count variables). You will encounter these models in Term 2 of POLS0010.
You can adjust for weights without using a survey design object by specifying the weights
option (e.g. weights=wt_GAP) when using the lm
. However, you cannot take into account clustering or stratification.
Let’s adjust for the survey weights using our survey design object.
wemwbs.model.srs<-svyglm(formula = wemwbs~age*Sex, design = srs_design)
summary(wemwbs.model.srs)
#>
#> Call:
#> svyglm(formula = wemwbs ~ age * Sex, design = srs_design)
#>
#> Survey design:
#> svydesign(id = ~PserialC, weights = ~wt_GAP, data = hse_and_shes12.weighted)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24.8322 1.4182 17.51 <2e-16 ***
#> age 0.1676 0.0263 6.37 2e-10 ***
#> SexFemale 5.3801 1.8655 2.88 0.0039 **
#> age:SexFemale -0.1103 0.0349 -3.16 0.0016 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 719)
#>
#> Number of Fisher Scoring iterations: 2
You will notice that the R output for the model fit information (e.g. r-squared and f-statistics) are excluded from the glm
function. You can call on the Akaike information criterion (AIC) statistic when using glm
by using the function AIC
followed by the object/model name. A lower value indicates better model fit for a nested model (i.e. a model adding additional explanatory variables). The estimates, standard errors, t-statistic and p-value are identical to a model fitted using the lm
function. More importantly, the cofficients will be different because you have taken into account non-response weighting. Try fitting the same regression model using te lm
function and specifying the option weights=wt_GAP
.
wemwbs.model.clus<-svyglm(formula = wemwbs~age*Sex, design = clus_design)
summary(wemwbs.model.clus)
#>
#> Call:
#> svyglm(formula = wemwbs ~ age * Sex, design = clus_design)
#>
#> Survey design:
#> svydesign(id = ~psuC, data = hse_and_shes12.weighted, weights = ~wt_GAP)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24.8322 1.6021 15.50 < 2e-16 ***
#> age 0.1676 0.0282 5.95 3.8e-09 ***
#> SexFemale 5.3801 1.7344 3.10 2e-03 **
#> age:SexFemale -0.1103 0.0331 -3.33 9e-04 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 719)
#>
#> Number of Fisher Scoring iterations: 2
Compare the models. Is there any difference? Due to clustering in the selection procedure, individuals are not selected independently. This results in correlation within clusters that can inflate variances of estimates compared to those obtained from a simple random sample (SRS) of the same size. The standard errors are greater in the wemwbs.model.clus
, except for sex.
17.2.3 Stratification
Generally speaking, the effects of clustering on the efficiency of survey estimates tend to be greater than those of stratification, meaning that ignoring stratification can be less of a concern than ignoring clustering, although weights should not be ignored, often even more so where disproportionate stratification has been employed. In the call to svydesign
that describes the survey design there is a new option strata=“stype” that specifies the stratifying variable. The argument nest=TRUE is used to indicate that the PSU identifier needs to be interpreted as nested within stratum: the same PSU id is recycled in different strata. Without nest=TRUE, svydesign() will check that PSUs do not overlap more than one stratum and report an error if they do.
strat_design <- svydesign(id = ~psuC, strata=~stratC, nest=TRUE,data = hse_and_shes12.weighted, weights = ~wt_GAP)
Now let’s run our model taking into account stratification:
wemwbs.model.strat<-svyglm(formula = wemwbs~age*Sex, design = strat_design)
summary(wemwbs.model.strat)
#>
#> Call:
#> svyglm(formula = wemwbs ~ age * Sex, design = strat_design)
#>
#> Survey design:
#> svydesign(id = ~psuC, strata = ~stratC, nest = TRUE, data = hse_and_shes12.weighted,
#> weights = ~wt_GAP)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24.8322 1.6275 15.26 < 2e-16 ***
#> age 0.1676 0.0290 5.77 1.1e-08 ***
#> SexMale 5.3801 1.7985 2.99 0.0029 **
#> age:SexMale -0.1103 0.0341 -3.24 0.0013 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 719)
#>
#> Number of Fisher Scoring iterations: 2
What do you notice about the difference between the model with and without stratification? The standard errors are slightly smaller when we included stratification, but overall not a big enough change in the current example to alter statistical significance of coefficients.
17.2.4 Strata with only one PSU
A common problem encountered when including stratification in your analysis is that of singleton stratum. When there is only one PSU within a stratum, there is not enough information from which to compute the stratum’s variance, making it impossible to compute the variance of an estimated parameter in a stratified clustered design.
The best way to handle a stratum with only one PSU is to combine it with another stratum, one that is chosen to be similar based on population data available before the study was done. The variance estimates will then be conservative: population estimates from the survey will be more accurate than they would have been if the strata had been combined at the design stage.
When strata with a single PSU cannot be (or at least have not been) combined to fix the problem, the default behaviour of the survey package is to report an error, but it also provides two approximations to fix the problem. These are controlled by options(survey.lonely.psu). Setting options(survey.lonely.psu=“adjust”) gives a conservative variance estimator that uses residuals from the population mean rather than from the stratum mean, and options (survey.lonely.psu = “average”) sets the variance contribution to the average for all strata with more than one PSU. The adjust option is conservative. The average option is intended for designs where the lonely PSUs result from nonresponse and where strata are roughly comparable, e.g., a design with two individuals sampled per stratum. Strata with only a single PSU in a subpopulation are likely to lead to poor variance estimation, and a warning is given. The same adjustments are applied at each level of sampling, e.g. to second-stage strata with only a single SSU. A single-PSU strata can also be dropped from the variance calculation with options (survey.lonely.psu =“remove”).
17.2.5 Incorporating (or ignoring) multi-stage design features in your analysis
Remember that we said the Health Survey for England (HSE) is a multi-stage stratified random sample. The primary sampling unit variable is the postcode sector (psuC
in our dataset) and the Secondary Sampling Unit (SSU) is household (HserialC
in our dataset). The following example considers what happens when we try to specify the secondary sampling unit:
strat_multi_design <- svydesign(id = ~psuC+HserialC, strata=~stratC, nest=TRUE,data = hse_and_shes12.weighted, weights = ~wt_GAP)
Refit your model taking into account the multistage cluster design:
wemwbs.model.multi<-svyglm(formula = wemwbs~age*Sex, design = strat_multi_design)
summary(wemwbs.model.multi)
#>
#> Call:
#> svyglm(formula = wemwbs ~ age * Sex, design = strat_multi_design)
#>
#> Survey design:
#> svydesign(id = ~psuC + HserialC, strata = ~stratC, nest = TRUE,
#> data = hse_and_shes12.weighted, weights = ~wt_GAP)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24.8322 1.6275 15.26 < 2e-16 ***
#> age 0.1676 0.0290 5.77 1.1e-08 ***
#> SexMale 5.3801 1.7985 2.99 0.0029 **
#> age:SexMale -0.1103 0.0341 -3.24 0.0013 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 719)
#>
#> Number of Fisher Scoring iterations: 2
Notice anything different? No. When using statistics packages that compute complex standard errors from multi-stage clustered samples it is generally only necessary to have a PSU variable in the dataset. Any clustering after the first stage generally does not have to be identified - the variance between PSUs automatically incorporates later stages of clustering. The main exception to this rule is where a multi-stage sample design and Finite Population Correction (FPC) is specified, then information from further sampling units is required. The FPC adjusts design effects downwards to account for the effects of increased sample sizes, although to have a substantial effect a very large sample size is needed. However, because we did not specify a Finite Population Correction (FPC) at the first stage, the sampling information at the second stage is irrelevant to variance estimation. Overall FPC often has little or minor impact on survey research.
17.3 Acknowledgements
This exercise was adapted from Lumley, T. (2010) Complex Surveys: A Guide to Analysis Using R. Wiley and Rafferty (n.d) Introduction to Complex Sample Design: Workbook. ESDS.
17.4 Additional Exercise
Q1a. You are conducting a survey of students at UCL, where you want to estimate the mean number of hours spent on private study. You can either send out a single questionnaire to all students, or send out a questionnaire to about 10% of the student body and send follow-up emais for questionnaires that are not returned. What are disadvantages of each approach?
Q1b. You choose to survey just a sample. What would be useful variables to stratify the sampling, and why?
Q2. Download the National Health and Nutrition Examination Survey (NHANES) 2003 - 2004 Demographic Variables and Sample Weights dataset (nhanes2003to04.csv
) from Moodle. NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the United States, similar to the HSE.
Construct a survey design object with these data. The sampling weights are WTINT2YR
for the main interview, the primary sampling units are identified by SDMVPSU
and the strata by SDMVSTRA
. You might want to look through the coodebook available here.
Q2a. Estimate the mean age RIDAGEYR
of the US adult population.
Q2b. How does mean age vary by sex RIAGENDR
?
Q3. The Living Costs and Food Survey collects information on spending patterns and the cost of living that reflects household budgets across the country. The 2013 survey is documented here. How many stages of sampling are there, and what are the strata and the sampling units at each stage?