Types of Missing Data and Ordinal Variables

Published

Last Updated in February, 2024

In this chapter, we review the types of missing data and ordered (or ordinal) variables present in the Delta Residents Survey (DRS) data set. As this documentation is intended to serve all different versions of the DRS data set, you may find some descriptions that do not apply to the version of the data set to which you have access. Irrespective, this chapter should satisfy questions about whichever data set you have.

1 Automated Data Schema Handling

Before exploring the complexities of the DRS data and the data product, the reader should be aware that the {cdrs} package facilitates the task of wrangling our complex data schema in a simpler manner. The data analyst need only be aware of the kinds of values present in the DRS data. By default, cdrs_read() allows you to load the DRS data into R without having to devise an algorithm yourself. This function corrects ordinal categorical factors and converts most missing values to NA. Only values that express uncertainty, eg. “I don’t know”, are left untouched. The intent of the default setting of cdrs_read() is to ease the application of {survey} package functions (eg. survey::svymean) that estimate some population statistic. The bulk of the heavy lifting is done by the function cdrs_revise(). We recommend you read the documentation for both functions1 and the following sections so that you may manually alter parameters as needed.

2 History of DRS Data

The DRS involved a large team of collaborators, advisers, and contractors. The implementation of the survey was done through the Qualtrics web service, while the de-identification and weighting of the data was performed by the team at the Institute for Social Research (ISR) at the California State University, Sacramento, with the involvement of their sub-contractor Marketing Systems Group (MSG). As such, the original data collection, and a portion of the data preparation was completed before the authors of this chapter had access to the data set. Although it is not strictly accurate to describe the product of ISR’s work as “raw data,” this is how we’ll address it for the remainder of this chapter.

The raw data was provided to us in the SPSS .sav format. As non-SPSS users, the core DRS team has little familiarity with the intricacies of this data format. We found that the sav format consists of a unique approach for conveying metadata that is moderately challenging to access in the programming language we use, R. The R language does not have a innate capacity to handle sav formats, and as such we used the {haven} package. While both R {haven} and the SPSS .sav format are sufficiently documented, we considered that the sav data format might pose a barrier to researchers interested in the DRS data set as it requires a basic understanding of object-oriented programming systems. In order to reduce this burden, we chose to supply the data in a simpler form (as a CSV), and provide the R package {cdrs} that does the heavy lifting with regard to accessing the advanced features of sav data.

The primary concern with the sav format and its manipulation in R is the manner in which certain metadata are embedded directly into each column of the table of data. For instance, an ordinal, diverging- or likert-scale variable, with values that range from "Very Satisfied" to "Very Disatisfied", has an inherent structure that affects the way the data are processed. Since the order matters, each “label” (eg. "Satisfied") is coupled with a numeric encoding (eg. 2). Additionally, Qualtrics provides nuanced information regarding how survey respondents choose not to respond. This matter is discussed at length in the next section. In summary, the raw data format, sav, includes a complex data structure that incorporates both order and alternative approaches to conveying missing data that poses a challenge to casual R users.

3 Final Data Format

The final data product that the DRS team provides primarily consists of a Comma-Separated Values file, or CSV. These files are known as “flat files”, meaning they are simple text files that rely on a combination of commas and new lines to convey a tabular data structure. In simpler terms, CSV files can be opened in any basic text editor because it is written as plain text. The pitfall with CSV files is that they do not convey any information on the attributes of columns present in the table. While R’s read.csv() or readr::read_csv() have options for classifying a column by the base data types in the language, eg. the R class “character” or “factor”, the CSV format itself does not incorporate this information directly. This leaves some algorithm (ie. function) to do the guesswork of determining which R class is appropriate, or the user to define it manually. These issues are further discussed in the following two sections.

3.1 Ordinal Variables

In the case where one reads a CSV file and applies the class “factor” to a column, it correctly converts the column into a categorical variable. However, such a process haphazardly applies an order that is based purely on the sequence by which the algorithm first encounters each category. For instance, the command factor(c("Satisfactory", "Unsure", "Very Satisfactory")) yields a factor (or categorical) variable in R, but the “levels” are assigned in the order they were inscribed. In this case, the first category is "Satisfactory", even though it should be "Very Satisfactory". This classification also fails to specify levels (or categories), like "Unsatisfactory", that are absent in the initial factor() command execution. While we can manually adjust all of these issues, the important point is that these matters are not addressed directly by the CSV format.

3.2 Missing Data

Qualtrics has the ability to convey missing data according to the way its absence is recorded by the web service. These include matters like, “Did the respondent see the question, but then decline to answer it?” Or, “Did the respondent not see the question?” These different types of missing data (or as described in other DRS documents as “missingness”) are embedded into the structure of the raw sav data in a way which cannot be reproduced by a CSV file. In the following sections we discuss both the way in which the DRS team accommodated issues around order and missingness, and we provide a list of all the different types of missing data.

4 The Schema of the DRS Data Product

The final data output of the DRS project is a combination of three files:

  1. A metadata file written as a Microsoft xlsx file.

  2. A CSV data file.

  3. A hash value stored in a txt file.

The purpose of the hash file is not discussed at length in this chapter, but a brief description is provided as a footnote 2. As discussed previously, the CSV file consists of the bulk of the recorded survey data, however the structure of this data is stored in the xlsx file. The xlsx, or metadata, contains information on each column/variable in the CSV file. Each variable in the metadata includes, wherever appropriate, the order and factors of the variable. As a means to reproduce the different types of missing data as recorded by Qualtrics, we provide different categories/factors for variables that have this information.

4.1 Missingness

Broadly speaking, we need account for the four categories by which missing values appear in the DRS data. We provide a table below that explores each category and their various permutations. It should be noted that the column “R Value” represents the way the missing data would appear in the comma separated values (csv) file, or in R once it was read using read.csv. The column labeled “Algorithm” below describes the way in which the cdrs::cdrs_revise() function handles each “R Value”.

Category R Value Description Algorithm
System Missing NA These values indicate survey respondents never read the question. Or stated differently, respondents were never shown the question. Typically, this means the respondents ended the survey without completing it. As is. (This is why we must always utilize the parameter na.rm = T when using functions from the {survey} package.)
Refused <Decline to answer> Originally, this value was recorded as 99 in the raw data. It means, respondents read the question, but declined to select or enter a response, or a button to “Decline to answer” was selected. By specifying preserve_refused = F, <Decline to answer> in categorical columns are converted to NA. By specifying preserve_factor_numeric = F, these values are converted to NA for numeric entry fields.
<Not applicable> Originally, this value was recorded as 98 in the raw data. It typically means the respondent decided to end the survey on Question 39, which asks if the respondent would like to answer supplemental questions. However, in one case, Q1a, this can mean the respondent entered “NA”. By specifying preserve_refused = F, <Not applicable> in categorical columns are converted to NA. By specifying preserve_factor_numeric = F, these values are converted to NA for the numeric entry field Q1a.
Uncertainty <I don't know> Originally, this value was recorded as 97, with varying textual descriptions in the raw data. These values were selected from a list of options for a question on the survey. In other words, it was a choice in a multiple-choice question. Unlike in some instances in the previous missing values, this one was intentionally selected by the respondent. By specifying preserve_uncertainty = F, these values are converted to NA.
<I Dont know why the Delta is important>
<Unsure>
Editorial Omission <Missing> These values did not exist in any form on the Qualtrics survey. Rather, in the processing of building variables that align with census or marketing data, eg. converting the question on gender to a gender binary, missing values or identifiable segments were recorded as 99 in the raw data. By specifying preserve_editorials = F, these values are converted to NA.
<Erased> This value was added in the process of creating the public data set to reduce identifiability for the geoid.county column.

5 Conclusion

The DRS survey data was rendered as a CSV file with a separate metadata xlsx file to convey the complex structure of the data, including the order of factors and the various types of missing data.

As a parting note, please utilize the {cdrs} package’s cdrs_read() as it produces a properly structured data set and it reduces the burden when sharing reproducible code.

Footnotes

  1. Run ?cdrs::cdrs_revise and ?cdrs::cdrs_read after you’ve installed the {cdrs} package.↩︎

  2. The hash value stored in this txt file was produced by running a SHA-256 bit cryptographic algorithm, using digest::digest(), on the CSV file when it was initially created. In simpler terms, this hash value provides a way to confirm that the original data file has not been altered in transit from the computer that generated the csv file (ie. the author’s computer) to the end user’s computer. Due to the nature of electronic storage, data can become corrupted, either in the process of transmitting the data over the internet, reading or writing it, or as the hardware itself is exposed to cosmic radiation that alters the circuitry. By applying a cryptographic or hash function on a file, we are able to provide a brief description of the content in a succinct form: a 64-character series of numbers and letters. Should an error in the data occur, the cryptographic hash function applied to this error-ridden data set would most likely yield a different hash value. While there is a possibility that two files will share the same hash value, this is highly, highly improbable. As such, this is a simple and quick way to confirm that you have a good copy of the DRS data set.↩︎