**Grouped data** are those classified into categories or classes, with their frequency as a criterion. This is done to simplify the handling of large amounts of data and establish their trends.

Once organized into these classes by their frequencies, the data compose a *frequency distribution,* from which useful information is extracted through their characteristics.

**Here’s a simple example of grouped data:**

Suppose we measure the height of 100 female students, selected from all basic physics courses at a university, and the following results are obtained:

The results obtained were divided into 5 classes, which appear in the left column.

The first class, between 155 and 159 cm, has 6 students; the second class, 160 – 164 cm, 14; the third class, from 165 to 169 cm, has the largest number of members: 47. Then follows the 170-174 cm class with 28 students and finally 175 to 179 cm with only 5.

The number of members of each class is precisely the *absolute **frequency* or *frequency* , and by adding them all together, the total data is obtained, which in this example is 100.

**Frequency distribution characteristics**

**Frequency**

As we have seen, frequency is the number of times a piece of data is repeated. And to facilitate calculations of distribution properties, such as mean and variation, the following quantities are defined:

– **Cumulative frequency** : obtained by adding the frequency of a class to the previous cumulative frequency. The first of all frequencies corresponds to the range in question and the last is the total number of data.

– **Relative frequency** : it is calculated by dividing the absolute frequency of each class by the total number of data. And if you multiply by 100, you get the relative percentage frequency.

– **Cumulative relative frequency** : it is the sum of the relative frequencies of each class with the previous cumulative. The last of the accumulated relative frequencies must equal 1.

For our example, the frequencies look like this:

**Limits**

The extreme values of each class or range are called *class boundaries. * As we can see, each class has a lower and upper bound. For example, the first class in the height study has a lower limit of 155 cm and an upper limit of 159 cm.

This example has clearly defined limits, however, it is possible to define open limits: if, instead of defining the exact values, it says “height less than 160 cm”, “height less than 165 cm” and so on.

**Borders**

Height is a continuous variable, so the first class can be considered to actually start at 154.5 cm, as rounding this value to the nearest whole number gives you 155 cm.

This class covers all values up to 159.5 cm, as heights are then rounded to 160.0 cm. A height of 159.7 cm already belongs to the next class.

The actual class limits for this example are, in cm:

- 154.5 – 159.5
- 159.5 – 164.5
- 164.5 – 169.5
- 169.5 – 174.5
- 174.5 – 179.5

**Amplitude**

The range of a class is obtained by subtracting the limits. For the first range of our example, we have 159.5 – 154.5 cm = 5 cm.

The reader can see that for the other intervals in the example, the amplitude is also 5 cm. However, it should be noted that distributions with intervals of different amplitudes can be constructed.

**class mark**

It is the midpoint of the range and is obtained by averaging between the upper limit and the lower limit.

For our example, the first class mark is (155 + 159) / 2 = 157 cm. The reader can check if the remaining class marks are: 162, 167, 172 and 177 cm.

Determining class marks is important as they are needed to find the arithmetic mean and variance of the distribution.

You are reading grouped data

**Central tendency and dispersion measures for clustered data**

The most widely used measures of central tendency are the mean, median, and mode, and they accurately describe the tendency of the data to cluster around a certain central value.

**A half**

It is one of the main measures of central tendency. On pooled data, the arithmetic mean can be calculated using the formula:

-X is the mean

-f _{i} is the frequency of the class

-m _{i} is the class tag

-g is the number of classes

-n is the total number of data

**median**

For the median, the range where observation n/2 is located must be identified. In our example, this observation is number 50, because there is a total of 100 dice. This observation is in the range of 165 to 169 cm.

Then you need to interpolate to find the numeric value that matches that observation, for which the formula is used:

Where:

-c = width of the range where the median is

-B _{M} = the lower edge of the range to which the median belongs

-f _{m} = number of observations contained in the median range

-n / 2 = half the total data

-f _{BM} = total number of observations __before__ the median interval

**fashion**

For the mode, the modal class is identified, the one that contains the most observations, whose class mark is known.

**Variation and standard deviation**

Variance and standard deviation are measures of dispersion. If we denote the variance with s ^{2} and the standard deviation, which is the square root of the variance as s, for pooled data, we have respectively:

AND

**Exercise solved of grouped data**

For the proposed distribution of college student heights at the beginning, calculate the values of:

the average

b) Medium

c) fashion

d) Variance and standard deviation.

**solution for**

Let’s create the following table to facilitate the calculations:

Substituting values and performing the sum directly:

X = (6 x 157 + 14 x 162 + 47 x 167 + 28 x 172 + 5 x 177) / 100 cm =

= 167.6 cm

**solution b**

The range to which the median belongs is 165-169 cm, because it is the range with the highest frequency.

Let’s identify each of these values in the example, with the help of table 2:

c = 5 cm (see amplitude section)

B _{M} = 164.5 cm

fm = _{47}

n / 2 = 100/2 = 50

f _{BM} = 20

Substituting in the formula:

The range containing most observations is 165-169 cm, whose class mark is 167 cm.

**solution d**

We expanded the previous table by adding two additional columns:

We apply the formula:

And we develop the sum:

s ^{2} = (6 x 112.36 + 14 x 31.36 + 47 x 0.36 + 28 x 19.36 + 5 x 88.36) / 99 = = 21.35 cm ^{2}

Therefore:

s = √21.35 cm ^{2} = 4.6 cm

We hope that after reading this article you have undestood the grouped data and its application.