Access provided by
London Metropolitan University
• To guide you in dealing with missing and erroneous data in cross-sectional nurse staffing surveys
• To understand that methods of managing missing and erroneous survey data are important for transparent and replicable research
• To ensure surveys are piloted and contain unambiguous questions so that participants understand them
Background Analysis can be problematic in research when data are missing or erroneous. Various methods are available for managing missing and erroneous data, but little is known about which are the best to use when conducting cross-sectional surveys of nurse staffing.
Aim To explore how missing and erroneous data were managed in a study that involved a cross-sectional survey of nurse staffing.
Discussion The article describes a study that used a cross-sectional survey to estimate the ratio of registered nurses to patients, using self-reported data by nurses. It details the techniques used in the study to manage missing and erroneous data and presents the results of the survey before and after the treatment of missing data.
Conclusion Managing missing data effectively and reporting procedures transparently reduces the possibility of bias in a study’s results and increases its reproducibility. Nurse researchers need to understand the methods available to handle missing and erroneous data. Surveys must contain unambiguous questions, as every participant should have the same understanding of a question’s meaning.
Implication for practice Researchers should pilot surveys – even when using validated tools – to ensure participants interpret the questions as intended.
Nurse Researcher. doi: 10.7748/nr.2023.e1878
Peer reviewThis article has been subject to external double-blind peer review and checked for plagiarism using automated software
Correspondence Conflict of interestNone declared
Al-Ghraiybah T, Sim J, Fernandez R et al (2023) Managing missing and erroneous data in nurse staffing surveys. Nurse Researcher. doi: 10.7748/nr.2023.e1878
Published online: 30 March 2023
Methods for dealing with missing and erroneous data in a cross-sectional survey of nurses undertaken as part of a doctoral research project are described in this paper.
Nurse researchers have used various methods to collect data about nurse staffing, nurses’ work environments, nursing care and patient outcomes. One method is the survey, which is a timely, low-cost way of acquiring primary quantitative data from nurses, patients and other healthcare providers (Kelley et al 2003).
Data obtained from self-reported surveys are prone to various sources of error. Missing data is a problem because most statistical models only work with complete observations of the variables (Salgado et al 2016). Researchers must therefore deal with missing data to minimise bias and reach appropriate conclusions (Marston et al 2010).
This includes understanding and reporting on the rate of missing data, the type of missing data and why they are missing, as well as dealing appropriately with missing data during analysis (Sharma et al 2021).
Numerous techniques exist for gathering data about nurse staffing. However, no approaches for dealing with missing data or addressing potential bias in the data collected by cross-sectional surveys have been documented previously (Sherenian et al 2013).
• Nurse researchers should use clear language that clarifies questions, and possibly give examples of responses, to clarify the questions
• When cleaning data, nurse researchers should check that the data are consistent and within the expected boundaries when similar responses are anticipated
• They should determine a suitable approach to handle missing data, depending on the reason for data being missing and the intended analysis
• When publishing study findings, nurse researchers should describe how they handled missing and erroneous data
Missing data occur in surveys when respondents accidentally omit questions, cannot provide requested information, prefer not to respond or do not complete the entire survey (Salgado et al 2016, Pedersen et al 2017). They can occur at the unit-level or the item-level (Fox-Wasylyshyn and El-Masri 2005, Yan and Curtin 2010): unit-level missing data happen when all items of a multi-item instrument are absent (Yan and Curtin 2010); item-level missing data occur when participants skip one or more items from a multi-item survey (Groves et al 2009). Unit-level non-response will not be covered further in this paper.
A common cause of item-level missing data is an inadequate understanding of a question’s purpose (Groves et al 2009). Erroneous data provided by respondents are also considered item-level missing data (Brick and Kalton 1996, Čehovin et al 2019).
Item-level missing data can be classified into three types: missing completely at random, missing at random and missing not at random (Rubin 1976, Walani and Cleland 2015, Sharma et al 2021).
If the likelihood of a missing value is the same for all participants, ‘missing completely at random’ occurs (Walani and Cleland 2015, Audet et al 2022). This is not usually a cause of major concern, as it does not cause a systematic bias in the results, although it does reduce the power of the study.
This contrasts with ‘missing at random’, which may result in bias in the results. This happens when the probability of missing data depends on observed information (Audet et al 2022). For example, younger nurses may be less likely to respond to a survey question about experience. Data that are missing at random can be dealt with by imputing the missing data item using available data, such as imputing years of experience using a model based on nurse age.
If the likelihood of there being missing data depends on unobserved data, this is considered ‘missing not at random’ (Pedersen et al 2017). For example, nurses with higher levels of burnout may be less likely to respond to questions about symptoms of burnout, resulting in a systematic bias in the estimated rate of burnout. This cannot be corrected as the unobserved data are unknown and cannot be modelled appropriately.
There are two common approaches to managing missing data: deletion and imputation.
Deletion refers to the exclusion from statistical analysis of participants for whom data are missing (Fox-Wasylyshyn and El-Masri 2005). Deletion strategies include complete case analysis and pair-wise deletion (Jamshidian and Mata 2007, Mirzaei et al 2022).
Complete case analysis – also known as list-wise deletion – is carried out by identifying participants (‘cases’) for whom data are missing for any survey item and excluding them completely from the overall analysis (Fox-Wasylyshyn and El-Masri 2005, Mirzaei et al 2022).
Pair-wise deletion excludes data if one or more variables are required for a specific analysis and one of them is missing – for example, participants have not responded to one of two variables used for bivariate correlation analysis – and therefore the pair of data is deleted (Laaksonen 2018). There is no consensus on an acceptable amount of missing data – some statisticians suggest that more than 15% of an item, while others say more than 40% of the item, should necessitate deletion of the item (Hertel 1976, Raymond and Roberts 1987).
However, the type of analysis and the reason for the data being missing are crucial when choosing between complete case analysis and pair-wise deletion (Allison et al 2014) or deciding whether an alternative strategy is required.
Imputation is the replacement of a missing value with a substituted value. Single imputation involves replacing each missing value with an imputed value (Penny and Atkinson 2012) while multiple imputation requires creating multiple values for each missing value, resulting in multiple imputed datasets (Groves et al 2009). Single imputation is quicker to implement and simpler for analysis, as there is a single dataset (Gómez-Carracedo et al 2014, Zhang 2016).
Two modelling methods for single imputation are ‘mean imputation’ and ‘regression imputation’ (Little and Rubin 2014). Mean imputation involves substituting sample means of a variable for each missing observation of that variable. Means may be calculated using classes, such as hospitals or wards.
Regression imputation extends mean imputation by substituting missing values with predicted values from a regression of the missing item on one or more observed items, generated from participants with observed and missing variables (Musil et al 2002, Little and Rubin 2014).
Nurse staffing in New South Wales (NSW) in Australia is regulated by industrial agreements that specify a minimum number of nursing hours per patient day (NSW Nurses and Midwives’ Association 2021). This regulation applies to tertiary teaching hospitals and mandates a registered nurse-patient ratio in a ward of 1:4 during morning and afternoon shifts and 1:7 during night shifts. Critical care wards such as intensive care units (ICU) are not subject to the same ratios but generally have a 1:1 ratio for all shifts.
The study described in this article was undertaken as part of a doctoral research project to investigate the relationship between the nursing practice environment and patient outcomes in 16 medical, surgical and sub-acute wards and one ICU in a large tertiary teaching hospital in NSW. Six validated tools with a total of 41 items were used to gather information about nurse staffing, nursing practice environments, missed care, care left undone, nurses’ perceptions of quality and safety, nurses’ perceptions of the occurrence of adverse events, working patterns, experience, work-related burnout, job satisfaction and whether nurses intended to leave their current job or profession over the next one to five years.
The survey was administered using PaperSurvey.iO. This online software enables paper surveys to be printed, the completed surveys to be scanned and the responses determined automatically using optical mark recognition, removing the need for manual data entry.
The survey asked a series of questions to help determine the ratio of nurses to patients on each shift. The survey measured workload by asking respondents how many patients there were on their wards on their latest morning, afternoon and night shifts. The staffing levels were measured by asking how many registered nurses (RNs), enrolled nurses and assistants in nursing were working on these shifts.
This section describes how missing and erroneous data were managed when calculating the number of RNs, the number of patients and the RN-patient ratio on the morning shift. Figure 1 includes some of the questions intended for those working the morning shift.
The first author (TA) initially verified the data on the paper forms. This included checking for missing pages.
All the paper-based surveys were then uploaded to the PaperSurvey.iO website for optical character recognition. TA validated the resulting data to determine if the software had been unable to read any items. Unreadable data included messy writing, entries written outside the boxes and entries where participants had corrected their original responses.
The collated data were then reviewed by the first author at the ward level to assess for missing data or errors or inconsistencies in responses in specific wards.
Single value group mean imputation (Brick and Kalton 1996) was used to treat missing and erroneous responses. We chose this method because we could use the information available from respondents in each ward and on each shift to estimate missing values and accurately predict the variables of interest.
All data were exported to SAS 9.4 for analysis. Frequencies were computed for the number of patients and the number of RNs in the ward per shift. The patient-RN ratio was calculated for each ward and shift by dividing the corresponding number of patients by the number of RNs; it was also calculated for each respondent. We considered responses where the patient-RN ratio was less than 3:1 to require further assessment unless there was evidence of consistent responses below this threshold in the same ward.
The impact of the method used to manage missing and erroneous data on the patient-RN ratio was assessed for three approaches:
• All reported data, including inconsistent responses.
• Complete data only, deleting inconsistent responses.
• All reported data, with mean imputation for missing and inconsistent responses.
The mean and standard errors for the estimated number of patients per RN in each ward were compared for each approach.
A total of 361 nurses from 17 wards responded to the survey; 94.7% (n=342) of the responses had no missing data.
Data were missing from a total of 35 survey items in the other 5.3% (n=19) of responses. Surveys from seven wards were missing data about the numbers of patients or RNs on the ward; 19 surveys were missing the number of patients, while 16 surveys were missing the number of RNs (Table 1).
We compared the RN-patient ratios reported in each ward and found that 22.2% (n=76) of the 342 surveys not missing any data nevertheless potentially contained incorrect data (Figure 2). For example, 11.5% (n=3) of respondents on ward B reported that there were four patients on their latest day shift while the remaining 88.5% (n=23) participants reported there were 26-31 patients. We postulated that some respondents may have indicated how many patients they cared for themselves, rather than the total number of patients on the ward.
To identify the extent of this problem, we calculated the RN-patient ratio for each of the 16 medical, surgical and sub-acute wards (Table 2). We then applied a threshold ratio of 3:1, to highlight potentially erroneous data, and found 66 surveys reported ratios below that threshold.
Further analysis was required for three additional surveys in wards F, N and P because the responses given concerning the numbers of patients and RNs were close to the ward mean. We determined these three surveys were not erroneous, so we did not include them in imputation. Wards F and G had the lowest mean RN-patient ratio, which was 0.6.
A different set of criteria was used for Ward L – the ICU – as the RN-patient ratio was expected to be approximately 1:1. Using those criteria, we found 10 of the 53 surveys contained erroneous data.
In total, we found that 73 surveys (21.3%) – 63 from the medical, surgical and sub-acute unit and 10 from the ICU – were inconsistent and contained erroneous data.
The mean numbers of patients and RNs at the ward level was used to impute 92 (25.5%) surveys: the 19 that were missing data and the 73 judged to contain erroneous data. A total of 115 data items needed imputation: 87 patient numbers and 28 RN numbers (Figure 2).
Following imputation, the mean number of patients increased in all but one ward (Table 3). For example, the mean number of patients in Ward E before imputation was 20.5 (excluding missing data) but was 30.0 following imputation.
The mean number of RNs changed in Wards E, M and L: one response from both Wards E and M was imputed, while nine from Ward L were imputed.
The RN-patient ratio for each ward when using the raw data was compared to the same ratio following imputation (Table 3). In the 16 main wards, the pre-imputation RN-patient ratio was lowest in Ward F (a surgical ward) (mean: 2.7; standard error of the mean (SEM): 0.3; range: 0.4-4.7) and highest in Ward A (a medical ward) (mean: 4.8, SEM: 0.6, range: 2.0-9.0).
Post-imputation, the RN-patient ratio increased in all wards except L and Q. Figure 3 shows the mean number of patients per RN and the 95% confidence interval for each ward pre- and post-imputation, ordered from the highest to the lowest mean number of RNs per patient in the raw data. The largest change to the estimated patient-RN ratio occurred in Ward E, where it increased from 3.2:1 to 4.66:1; the smallest change was in Ward F, where it increased from 2.7:1 to 3.6:1.
The confidence intervals were narrower after imputation for missing and erroneous data. For instance, the 95% confidence interval was reduced from 1.0 to 0.2 post-imputation in Ward E. This is expected when missing and erroneous data are imputed using the mean. The number of patients per RN should be similar across shifts, so it is also a desirable outcome.
In our study, the item-level missing data were either missing (n=19; 5.3%) or erroneous (n=73; 21.3%). The erroneous data were likely to have occurred because respondents did not understand the question’s purpose (Groves et al 2009) or misinterpreted the questions being asked. For example, the erroneous responses to the question about how many patients were receiving care (Figure 1) may have resulted from respondents misinterpreting the question and providing the number of patients they had seen themselves.
This illustrates that how you present questions can greatly alter their interpretation. It is important to consider how you phrase and frame questions (Audet et al 2022). We recommend that you use simple language that clarifies the question and potentially include an example response; multiple choice answers can also reduce the number of missing or erroneous entries (Keough and Tanabe 2011). You should also pilot surveys in a relevant population, even if you are using a previously validated tool.
Having accurate data that falls within a predicted range is crucial for appropriate inference. How you handle missing and erroneous data is critical to your analysis and for making valid inferences. Check when cleaning data that they are consistent and within the boundaries you expect. For example, in our study, the number of patients reported in Ward B ranged between four and 30; after imputation, the number ranged between 26 and 30.
Missing data are typically a cause of frustration when you are analysing data. The extent to which erroneous data may impair your conclusions is based on the type and quantity of the discrepancy (Fox-Wasylyshyn and El-Masri 2005). The pattern and degree of error in the study discussed in this article had the potential to jeopardise its findings as the errors could have led to our underestimating how many patients there were per nurse, which may have affected the accuracy of this important explanatory variable. Our strategy for resolving missing and erroneous data increased the quality of the data we collected from the participants and improved our estimates of the RN-patient ratio in each ward.
Guidance is available about how to manage missing data; however, there is limited information available about how to handle inconsistent and erroneous data. This article addresses this issue using the example of a doctoral study examining nurse staffing. It outlines a systematic approach to assessing and handling inconsistent data and is to our knowledge the first to detail the variations pre- and post-imputation of missing data in cross-sectional surveys that estimate RN-patient ratios using data reported by nurses.
Research findings must be the result of rigorous and reproducible methods, to ensure that research evidence can be appropriately applied in clinical practice. It is important when reporting research findings to publish information about how missing data were handled and to describe the techniques used to manage missing and erroneous data. Managing missing data effectively and reporting procedures transparently reduces the possibility of results being biased and increases the reproducibility of research.
Views of specialist head and neck nurses about changes in their role
The Cancer Reform Strategy (Department of Health 2007)...
The biology of cancer
Cancer research is moving fast. Understanding of the biology...
The role of lung cancer nurse specialists
A report published by the National Lung Cancer Forum for...
Exploring the education and information needs of patients on oral anticancer medicines
Aim To explore the education and information needs of...
Educating healthcare support workers in cancer and palliative care
The Cancer Care Alliance (CCA) Network for Teesside, South...