A comparative study of optimal stratification in business and agricultural surveys
Degree GrantorUniversity of Canterbury
Degree NameMaster of Science
This thesis is a comparative study of optimal design-based univariate stratification as applied to highly skewed populations such as those observed in business and agricultural surveys. Optimal stratification is a widely used method for reducing the variance or cost of estimates, and this work considers various optimal stratification algorithms, and in particular optimal boundary algorithms, to support this objective.
We first provide a background to the theory of stratification and stratified random sampling, and extend this through the derivation of optimal allocation strategies. We then examine the effect of allocation strategies on the variance and design effect of estimators, and in particular find several issues in applying optimal or Neyman allocation when there is little correlation between the survey population and auxiliary information.
We present a derivation of the intractable equations for the construction of optimal stratum boundaries, based on the work of Dalenius (1950), and derive the cumulative square root of frequency approximation of Dalenius & Hodges (1957). We then note a number of issues within the implementation of the cumulative square root of frequency rule surrounding the construction of initial intervals, and find that the placement of boundaries and the variance of estimates can be affected by the number of initial intervals. This then leads us to propose two new extensions to the cumulative square root of frequency algorithm, using linear and spline interpolation, and we find that these result in some improvements in the results for this algorithm.
We also present a complete derivation of the Ekman algorithm, and consider the extended approach of Hedlin (2000). We derive several new results relating to the Ekman algorithm, and propose a new kernel density based algorithm. We find all three Ekman based algorithms produce similar results for larger populations, and provide some recommendations on the use of these algorithms depending on the size of the population.
We look at the derivation and implementation of the Lavallee-Hidiroglou algorithm, and find that it is often slow to converge or does not converge for Neyman allocation. We therefore adopt a random search model of Kozak (2004), and note that the Lavallee-Hidiroglou algorithm generally produces superior results across all populations used in this thesis.
We briefly investigate the optimal number of strata by examining the work of Cochran (1977) and Kozak (2006), and find that there is a diminishing marginal effect from increasing the number of strata and possibly some benefit from constructing more than six strata. However we also acknowledge that the cost of constructing such strata may offset any potential gain in precision from constructing more than five or six strata.
Finally we consider the how many of these problems can be developed further, and ultimately find that such problems for deciding the number of strata, construction of stratum boundaries, and the allocation of sample units among the strata may require an approach that takes account of the relationship between the auxiliary variable and the survey information. We therefore suggest investigating these algorithms further within the context of a model-assisted environment in order to help account for the relationship between the auxiliary information and survey population.