# Using a “Proxy” Variable for Stratification

In practice, one will not know the frequency distribution of the variable of interest – in this example, the number of stores selling one case of Wheaties, two cases and so on. If this were known, there would be no need for sampling; the man monthly sales per store would be calculated directly from the frequency distribution for the entire universe. However, information will often be available on a variable that is highly correlated with the variable of interest. Such a “proxy” variable may then be used to make up strata.

Continuing with the Wheaties example, one would expect that Wheaties sales would be highly correlated, with total store dollar (the bigger the store, the larger its Wheaties sales likely to be). So stratification based on this correlated variable would likely be a good proxy for Wheaties sales themselves – the ideal (but unknown) basis for stratifying the universe. By sampling within strata based on store dollar sales, one could achieve much of the benefit of stratifying on the basis of actual Wheaties sales.

Estimation of the Universe Mean:

Because a stratified random sample is a group of simple random samples, the sample mean of each stratum is an unbiased estimate of the actual mean of that stratum. Therefore, the individual stratum sample means can be combined (weighted) into an unbiased estimate of the overall universe mean. Thus, the estimate of the overall universe mean is simply a weighted average of the strata sample means.

For example, suppose the following data for Wheaties were available for the three grocery store sized strata:

Store Size Sample Mean Number of stores % of stores
Stratum Wheaties sales
per store

Large 200 20,000 20%
Medium 80 30,000 30
Small 40 50,000 50

Totals 100,000 100%

Then, to estimate the universe mean of monthly Wheaties sales, each stratum sample mean is multiplied by its relative weight (i.e. percent of all stores), and the results are added together. In this illustration, the estimated universe mean is

(200)(20%) + (80) (30%) + (40) (50%) = 84 units per store.

Confidence Interval Estimation with Stratified random samples

Conceptually, little new theory is needed to develop confidence limits for a universe mean estimated from a stratified random sample. As in the case of simple random sampling, an estimated standard error of the mean is first secured. Then the appropriate multiple of this figure (e.g. Z=2 for 95.4 percent confidence) is added to and subtracted from the estimated mean to provide the desired confidence limits.

Because of the complexity of the formula for estimating the standard error of the mean in stratified random sampling, this topic will be discussed only briefly: This formula involves three elements: (1) the size of the sample within each stratum, (2) the relative size of each stratum compared to the total universe, and (3) the variance within each stratum. Once obtained, the estimated standard error of the mean is used to construct a confidence interval in the manner described in the preceding paragraph.

It bears emphasis that calculation of the standard error is different in stratified random sampling than in simple random sampling. In general, different methods of probability sampling require different methods of evaluating means and associated standard errors. The more complex the design, the more complex is the method of analysis.

A Financial Publication Example:

Another example is the Financial publication problem. To meet the need for reporting data separately for the three leading areas, each with adequate precision, one could stratify by geography to create four strata: New York, Boston, Philadelphia, and All Remaining US. Then a simple random sample of 125 subscribers could be chosen from within each stratum. This would meet the advertising department’s requirement for samples with adequate precision.

Suppose the sample results showed that the mean incomes of subscribers in the four strata used were: New York (\$120,000 per year); Boston (\$80,000); Philadelphia (\$75,000); and Remaining US (\$70,000). Then the US mean subscriber income would be estimated as:

\$ 120,000(41%) + \$80,000(11%) + \$75,000(10%) + \$ 70,000(38%) = \$ 92,100.