Abstract
AbstractBackgroundTaking a representative sample to determine prevalence of variables such as disease is difficult when little is known about the target population. Several methods have been proposed that apply cluster sampling techniques to Primary Sampling Units (PSUs). The PSUs are typically towns or census tracts. Some methods are based on random walks within towns, e.g., the original World Health Organization’s Extended Program on Immunization (‘EPI’) surveys and variants, including sampling from four quadrants of each town (‘Quad’). Several major international surveys take random samples from small areas (‘SA’) such as census tracts. Another method uses satellite images and Global Positioning Systems to randomly sample within PSUs from squares in a superimposed grid (‘Square’). We used computer simulations to compare these sampling methods and simple random sampling within towns (‘SRS’) in virtual populations. SRS was our standard, even though it is impractical in low-information settings.MethodsWe constructed 50 virtual populations with varying characteristics, each comprising about a million people spread over 300 towns. The risk of disease for each person varied within and between towns. We created a binary exposure variable and allocated disease statuses to individuals assuming four relative risks (RRs) from exposure. We added three populations with equal risk of disease for every person in the population.For each population, each of the sampling methods – EPI, Quad, SA, Square, and SRS - and each of three sample sizes per PSU (7, 15, and 30), we simulated 1000 samples. For each simulation we estimated the prevalence and RRs. We used the bias and variance of the estimates to calculate the Root Mean Squared Error (RMSE) of these estimates. We ranked the RMSEs of each method and computed the ratio of each method’s RMSE to that of SRS for each population. We computed the mean ranks and ratios across the 50 populations.ResultsApart from SRS, Square had the lowest mean rank of RMSEs for all samples sizes when estimating prevalence. When estimating RR, Square had the lowest mean ranks for samples sizes of 15 and 30 per PSU; for n=7 per PSU, the Quad mean rank was the lowest.The results for the mean ratios of RMSEs showed the same pattern; Square had the lowest values for all sample sizes when estimating prevalence and the lowest values for the two larger sample sizes when estimating RRs. Notably, when estimating prevalence, the ratios increased with sample size per PSU for SA, Quad, and EPI, suggesting those methods did not benefit as much from the larger sample sizes as would be expected from statistical theory.ConclusionsOf several methods that are practical in an imperfectly known population, the Square method was mostly the best, especially for the larger sample sizes. The methods that sample within small areas (Quad, SA, and EPI) do not gain as much statistical benefit as expected from larger sample sizes per PSU, because of some clustering within the areas.
Publisher
Cold Spring Harbor Laboratory