Missing Observations in Survey Data: An Experimental Approach (CBR project)

Overview

Aims and objectives

In this short research project our principal objective was to construct a quasi-experimental framework and examine the performance of a number of imputation methods applied to the problem of non-response in business survey data, focussing upon missing data in firm level profits. Analogous to the problem of evaluating non-experimental methods for controlling for sample selection, we adopted a methodology that exploits the availability of a comparable profit series with no missing data. Since this data series has observed profit entries for a high percentage of the CBR SME survey data covering both respondents and non-respondents, we could use this series as a reliable benchmark. We evaluated the performance of a number of imputation procedures. First, we used the simple conditional expectation approach, which assumes that non-response is ignorable and imputes missing profit data using a model based solely upon the sub-sample of respondents. Second, we evaluated the hot-deck approach which utilises the principle of matching to identify donor respondents for missing data. We compared these approaches with a procedure which allows for a non-ignorable response mechanism and is based upon joint estimation of the parameters from a probability model of non-response and structural profits equation. Given that we had access to a fully observed profit series, we could evaluate the extent of the imputation bias for the two approaches.

The drawback of most standard imputation techniques is that the process of replacing missing with fitted values, ignores any residual variation insofar as missing values are subsequently treated as if they were known. As a result the unconditional variance (and conditional variance in regression) will be less than that of the original unobserved series, with similar implications for estimates of standard errors of model parameters. Thus, we also examined a number of techniques designed to examine the variability of parameter estimates where a subset of the data points have been imputed. Based upon the work of Rubin , multiple imputation proceeds by nesting the estimation procedure within an outer loop which iterates over multiple random imputations of the missing data, rather than a single imputed value. In this way, we were able to construct a distribution of likely imputed values and thereby properly integrate this element of uncertainty in model estimation.

Progress and findings

We carried out an extensive survey of the missing data literature (Weeks (2000)) This critically evaluates various approaches to imputation, and in addition considers the availability of software to treat missing value problems. We then assembled the necessary data to carry out the evaluative empirical work using the ICC and CBR SME databases. The techniques evaluated ranged from simple regression-based methods with and without corrections for non-random non-response, to matching procedures designed to locate firms which, if similar on a set of observed firm-level attributes, may (given a number of assumptions) be used as donors i.e. supplying profit data for otherwise similar firms which do not report profits. The principle problem encountered in undertaking the research turned out to be the presence of inconsistencies in the reported profits for the same firm in the two data series where profit data was present. This is obviously important given that we used the ICC data as the benchmark against which to evaluate the performance in predicting missing values in the CBR series across the different imputation techniques. Thus, in interpreting our results we recognise that imputation error will contain two components: a constant term (across techniques) reflecting the lack of comparability, and true prediction error which will vary across techniques. This problem was less serious in estimating missing employment data. Notwithstanding this problem we were able to identify a definite ranking across the techniques: the re-sampling technique, referred to as Multi-Imputation performs best, with both regression-base and matching procedures providing disappointing results. In identifying directions for the future research we emphasise that our analysis has been conducted in a univariate setting. That is we have conducted imputation and evaluated different techniques by focusing upon the incidence of missing data in a single variable. Although we can readily motivate the importance of profit data, and why one might expect systematic patterns in non-response, it is also the case that in large databases the incidence of missing data is widespread, and importantly, the pattern of missing data may be correlated across survey questions. We found this to be the case in this study, noting the explanatory power of a variable capturing the extent of missing data across a wide set of firm characteristics in a model of missingness for profits. Subsequently we believe that a useful extension of the research conducted here would be to work with multivariate models of missing data.

To date we have carried out an extensive survey of the missing data literature. The literature survey in Weeks (2000) critically evaluates various approaches to imputation, and in addition considers the availability of software.

Project leaders

Alan Hughes
Melvyn Weeks

Project status

Completed

Output

Working papers

Weeks, M., and Hughes, A. (2001) “Missing Observations in Survey data: An Experimental Approach to Imputation”, mimeo, Centre for Business Research, University of Cambridge

Weeks, M. (2000) ‘Methods of imputation for missing data’, Mimeo, Centre for Business Research and Department of Applied Economics, University of Cambridge.

Top