<- data %>%
subsetQ2 # Select the variable(s) of interest, and the Zone and weights columns
select(Q2, Zone, WTFINAL) %>%
# Remove NA values in the weight column
filter(!is.na(WTFINAL))
Survey Weights & Complex Survey Designs
1 Introduction
In this chapter we will explore the importance and role of survey weights in the Delta Residents Survey (DRS), as well as a broader discussion on complex survey design. In general, quantitative inferential survey analysis involves a special branch of statistics that accounts for the unique qualities and assumptions of survey methodologies. Be advised that a proper statistical analysis cannot be conducted without the consideration of our complex survey design.
Please note that this document is in draft form! Some of the technical notes and citations have yet to be reviewed. Although this chapter was prepared by someone who is trained in statistics, the author is not a subject matter expert in the field of statistics. The guidance and information provided below should be considered provisional.
2 DRS Complex Survey Design
The Delta Residents Survey involves a probability-based sampling plan with stratified sampling across geographical zones and systematic application of survey weights to account for varying response rates across demographic groups. This plan is considered a probability sampling method because it involves random selection processes within predefined strata (ie. the Delta zones); specifically, 100% of the rural primary zone of the Delta were sampled or invited to participate in the survey; and a randomly selected 25% of residential addresses in the suburban/urban secondary and tertiary zones of the Delta were sampled. These sampling rates were determined by project budgetary constraints
3 Why are weights important, and what are they?
In order to mitigate issues with bias, we use post-hoc survey weights to estimate the survey population (Heeringa, West, and Berglund 2017, p38). Weights are often used to adjust for differences in selection probabilities, non-response, self-selection, and to align the sample more closely with the population structure. Weights are needed in most surveys because survey sampling plans usually deviate from Simple Random Sampling (SRS) (in which every unit in the population has an equal chance of selection, often impractical or insufficient for complex survey objectives).
Relative to SRS, the need to apply weights to complex sample survey data changes the approach to estimation of population statistics or model parameters. Also relative to SRS designs, stratification, cluster sampling, and weighting all influence the sizes of standard errors for survey estimates. (Heeringa, West, and Berglund 2017, p26)
In other words, various approaches to the design of the survey sampling procedure introduce effects on the accuracy and precision of survey estimators (ibid, 26). In the DRS, survey weights are employed to account for different inclusion probabilities due to stratification (ie. the differential sampling between Delta zones), and due to external factors like non-response or self-selection bias.1
In the excerpt below, Heeringa et al discuss one approach to conceptualizing how weights relate to individual survey responses:
A simple but useful device for “visualizing” the role of case-specific weights in survey data analysis is to consider the weight as the number (or share) of the population elements that is represented by the sample observation. Observation i, sampled with probability \(f_i = \frac{1}{10}\), represents 10 individuals in the population (herself and 9 others). (Heeringa, West, and Berglund 2017, p38)
In summary, survey weights can be conceptualized as a numeric score that increases or decreases each respondent’s importance in building population estimators, like the mean length of residency or the proportion of a categorical variable like types of housing. Survey weights are an important (partial) corrective to biases introduced by a complex survey design and external factors. Given our intent to correctly represent the views expressed by residents of the Delta, we should use survey weights.
If the data are subset by a demographic level (e.g. evaluating how residence time affects sense of place in the Delta for non-english speakers only), then please consider carefully whether to implement the use of weights.
4 The Construction of DRS Weights
The DRS survey weights were constructed by Marketing Systems Group (MSG).
Weights for this survey were computed using the WgtAdjust procedure of SUDAAN, which relies on a constrained logistic model to predict the likelihood of response as a function of a set of explanatory variables (Marketing Systems Group and Fahimi 2023).2
A more expansive description of the weighting methodology was prepared by MSG and is supplied in the 2023 Summary Report available on the home page.
5 Using DRS Weights: An Example
In the following code chunks, we demonstrate the basic process of developing weighted proportions or frequencies in the R-language for a single variable from the survey, Q2
. This demonstration is provided to explain how to approach such an analysis, however we encourage you to use the {cdrs} package in your actual analysis (see Section 6).
In the code that follows, we make the following assumptions:
- The R object
data
represents the full survey results from the public version of the DRS data set. - The column
Q2
which represents question 2 on the survey, has missing values represented asNA
. In other words, the different types of missing values have already been reduced to a uniformNA
value. We also assumeQ2
started out as the R classfactor
. Remember, R factors are a data type that represents categorical information with a limited number of levels, eg. “Lives in the Delta”. - We primarily rely on two packages: survey, and tidyverse. We assume you have a basic understanding of the {tidyverse}, such as the role of pipes
%>%
.
First, we create a subset of the data in which we’re interested. We will isolate three columns: the variable of interest Q2
, the survey strata Zone
, and the final weights WTFINAL
.
Second, we stipulate the “survey design” which describes the design of our complex survey in a format that statistical functions from the {survey} package can utilize. Here, we will explain our reasoning for how we defined each parameter in survey::svydesign()
. For more details, please see the documentation for this function.
ids
: This refers to cluster sampling, a method we did not employ. The~1
simply indicates there is no clustering.fpc
: This defines the ‘finite population correction’. According to Heeringa et al. (2017), the fpc “reflects the expected reduction in the sampling variance of a survey statistic due to sampling [without replacement]” (p24). We would assume our fpc has little effect (and approaches1
) when our sample is much smaller than the total population (N) we’re estimating. For our survey’s three zones, only Zone 1 has a sample that might be considered important for the fpc: we have 326 responses out of a total population of 11,727 (2.78%). We are thus presented with two options: specify thefpc
asNULL
which assumes no correction needs to be made (given that 2.78% may still be considered sufficiently small), or specify thefpc
to account for the different populations by Zone.3 We did robustness checks to see if adjustingfpc
affected results and we did not see a difference when estimating weighted proportions/frequencies. We believe settingfpc
asNULL
is both appropriate and beneficial as it is parsimonious.data
: The subsetted data. Importantly, this data cannot have any missing values for the weight column,WTFINAL
.strata
: The column that specifies the strata, in our caseZone
.weights
: The column that specifies the weights, in our caseWTFINAL
.
<- svydesign(
designQ2 # This arg specifies cluster ids. In our case, the DRS has none.
ids = ~1,
# The inclusion of the finite population correction, in the DRS case, fpc is not significant so it is set to NULL
fpc = NULL,
# The cleaned data set, including the removal of missing values.
data = subsetQ2,
# Specify the strata (in our case Zone geographies)
strata = ~ Zone,
# Specify the column with weights.
weights = ~ WTFINAL
)
Finally, we can calculate the weighted proportions of the various categories in Q2
.
<- survey::svymean(
prop # x is the formula
x = ~ Q2,
design = designQ2,
na.rm = T)
For more advanced users, be aware that you can also write the formula in a way that’s more programmatic. For instance, if you stored the name of your variable of interest, Q2
, in a vector (eg. var_ <- "Q2"
), you could write the formula for svymean
(or any other formula-taking function) with as.formula()
. In this case, we would specify it like this:
svymean(x = as.formula(paste0("~", var_)),
design = designQ2,
na.rm = T)
Here is another example, this time calculating weighted frequencies with a breakdown of the variable Q2
by Zone
. To accomplish this, we can use svytable()
.
<- survey::svytable(
freq # Note, `x` is now `formula`
formula = ~ Q2 + Zone,
design = designQ2
# And note, svytable does not have `na.rm`
)
6 Using the cdrs package
In practice, we highly recommend you use the package {cdrs} to reduce chances of user error, to reduce coding, and to create more reproducible code documents. Learn more:
References
Footnotes
Non-response introduces bias when households selected to be surveyed do not respond. This may occur regardless of the household’s willingness to participate, eg. socio-economic barriers effect the timely delivery of surveys. Whereas, self-selection bias occurs when the likelihood of participation in the survey is not random but influenced by characteristics of the individuals. This bias can lead to certain viewpoints or demographic profiles being overrepresented or underrepresented in the survey data.↩︎
MSG cites the SUDAAN Manual (Research Triangle Institute 2012), and they likely meant to refer to the
WTADJUST
procedure. A copy of the SUDAAN manual can be made available by the DRS team upon request.↩︎In order to specify the fpc, this is the approach to take: 1) Create a data.frame with the total population (
N = c(11727, 540340, 166085)
) for each Zone (Zone = factor(c(1, 2, 3))
). 2) Merge the DRS data with this data.frame usingdplyr::left_join
. 3) Specifysvydesign
such that the parameterfpc
equal~ N
, whereN
is the column with the total population for the three Zones.↩︎