Cluster Sampling

From ICE Primer: A Tobacco Control Research Methodology Primer

The term cluster sampling refers to a situation where the sampling frame is made up of small groups or clusters of individual units; the sampling design is carried out by sampling clusters, and then taking all individuals within each selected cluster. If the survey population is stratified, each stratum is made up of clusters, and cluster sampling can take place in each stratum.

Some examples:

(i) If a sample of schoolchildren is obtained by taking all children in each of a sample of classes, we have cluster sampling of children, where the clusters correspond to classes.

(ii) If a sample of households is obtained by selecting a sample of postal codes, and then visiting every household in each sampled postal code, we have cluster sampling of households, where the clusters correspond to postal codes.

Note that cluster sampling does not require detailed knowledge of the frame below the cluster level.

With cluster sampling, the inclusion probability of an individual unit is the same as the inclusion probability of its cluster. Thus typical practice is to construct a sampling design which gives every cluster (or every cluster within a stratum) the same inclusion probability. The inclusion probabilities are then also uniform for individuals.

Information and efficiency of estimation

Cluster sampling is often more economical and convenient than a design which draws units one by one. However, there is usually a price to pay in terms of information obtained and the precision of estimation. Intuitively, this is because individuals within a cluster are typically more alike than individuals in the population as a whole. Thus, the amount of information about the population may be smaller in a cluster sample than in a dispersed sample of the same size. It can be shown mathematically that in this typical situation where individuals are alike within clusters, estimates from a sample with equal sized clusters are less precise than estimates from a dispersed design with the same sample size and inclusion probabilities. If we compare estimates of a population average of y from a simple random sample of n clusters of size m with corresponding estimates from a simple random sample of nm individual units, we find that in a large population the variance of the former is

1+(m-1)\;ICC

times as much as the variance from the latter, where ICC is the intra-cluster correlation coefficient of the variable y. Such a variance inflation factor is sometimes referred to as a design effect. The design effect is automatically accounted for in analysis if correct cluster-based variance estimates are used. However, at the planning stage, where the design effect must be taken into account in sample size calculations, it is necessary to guess at the design effect, taking into account experience with similar variables in other surveys.

Sampling in stages

A more general sampling design is two-stage sampling, where frame clusters are often called primary sampling units or PSUs. A sample of PSUs is taken, followed by a subsample of individual units in each selected PSU. For example, the PSUs may correspond to postal codes and a simple random sample of 10 households may be selected within each sampled postal code. This time, the inclusion probability of an individual unit is the inclusion probability of the PSU multiplied by the inclusion probability of the unit in the PSU subsample. If the same number of units is to be taken from each selected PSU, it is usual practice to select the PSUs with inclusion probabilities proportional to their sizes (pps sampling), since this will make the inclusion probabilities of individual units uniform.

The samples from a two-stage sample with several units selected from each PSU are "clustered", though not to the same extent as cluster samples according to the definition in this article. Thus the convenience of sampling in stages is again offset by a loss in precision of estimation.

The term multistage sampling refers to a sampling design where the frame construction and sampling are carried out in several stages within strata. For example, the primary sampling units or PSUs might be counties, the second-stage sampling units neighbourhoods, and the third-stage units households.