Lecture No. 6

Dated: 05-11-2024

John Tukey introduced the Stem-and-Leaf Display in 1977 to address the limitation of frequency tables losing individual observations.
This display splits each number into a stem (leading digits) and a leaf (trailing digits), separated by a vertical line, allowing simultaneous sorting and visualization of the data.

Leading Digit	Trailing Digits
2	43
Stem	Leaf

OR

Leading Digit	Trailing Digit
24	3
Stem	Leaf

Example

The ages of 30 hospital patients range from 12 to 74.
Construct a stem-and-leaf display by using the leading digit as the stem and the trailing digit as the leaf.
For example, 48 is split into a stem of 4 and a leaf of 8. The data is organized in the order it appears.

Stem (Leading Digit)	Leaf (Trailing Digit)
1	82
2	967
3	17905
4	830289
5	412378
6	415278
7	14

But it is a common practice to arrange the trailing digits in each row from smallest to highest.

Stem (Leading Digit)	Leaf (Trailing Digit)
1	28
2	679
3	01579
4	023889
5	123478
6	124578
7	14

Frequency Distribution

Class Limits	Class Boundaries	Frequency
10 - 19	9.5 - 19.5	2
20 - 29	19.5 - 29.5	3
30 - 39	29.5 - 39.5	5
40 - 49	39.5 - 49.5	6
50 - 59	49.5 - 59.5	6
60 - 69	59.5 - 69.5	6
70 - 79	69.5 - 79.5	2

The x axis represents age and y axis represents number of patients.

Description of Variable Data

In statistical inquiries, a concise numerical description is preferable to lengthy tables, especially if it helps visualize and interpret the data's significance.

Measures of Central Tendency and Measures of Dispersion

Averages enable us to measure the central tendency of variable data
Measures of dispersion enable us to measure its variability.

Averages

An average is a single value that represents a data set or distribution, serving as a central value around which observations cluster.
It indicates the distribution's position on the X-axis, hence is referred to as a measure of central tendency or location.

Example

Looking at these two frequency distributions, we should ask ourselves what exactly is the distinguishing feature?
If we draw the frequency polygon of the two frequency distributions, we obtain

The frequency polygons for the two suburbs have the same shape but differ in position relative to the X-axis.
The mean number of rooms per house is 6.67 in suburb A and 7.67 in suburb B, indicating that, on average, houses in suburb B are larger than those in suburb A by one room.

Various Types of Averages

The arithmetic mean
The geometric mean
The harmonic mean
The median
The mode

The Arithmetic, Geometric, and Harmonic means are mathematical averages that reflect the magnitude of observed values.
The Median shows the middle position, while the Mode identifies the most frequent value in the data set.
The Mode is the value that occurs most often, representing the most common result.

Example

Suppose that the marks of eight students in a particular test are as follows: 2, 7, 9, 5, 8, 9, 10, 9.
Obviously, the most common mark is 9.
In other words,

\[\text{Mode} = 9\]

Mode in case of Raw Data of a Continuous Variable

For ungrouped raw data of a continuous variable, the mode is determined by counting the frequency of each value.

Example

Suppose that the government of a country collected data regarding the percentages of revenues spent on Research and Development by 49 different companies, and obtained the following figures

Company	Percentage	Company	Percentage
1	13.5	14	9.5
2	8.4	15	8.1
3	10.5	16	13.5
4	9.0	17	9.9
5	9.2	18	6.9
6	9.7	19	7.5
7	6.6	20	11.1
8	10.6	21	8.2
9	10.1	22	8.0
10	7.1	23	7.7
11	8.0	24	7.4
12	7.9	25	6.5
13	6.8	26	9.5
27	8.2	39	6.5
28	6.9	40	7.5
29	7.2	41	7.1
30	8.2	42	13.2
31	9.6	43	7.7
32	7.2	44	5.9
33	8.8	45	5.2
34	11.3	46	5.6
35	8.5	47	11.7
36	9.4	48	6.0
37	10.5	49	7.8
38	6.9

Dot Plot

A dot plot uses a horizontal axis to represent a quantitative variable, with each data measurement indicated by a dot.
Repeated values result in stacked dots at the corresponding numerical position.

Also, this dot plot shows that - almost all of the R&D percentages are falling between 6% and 12%. - most of the percentages are falling between 7% and 9%.

\[\hat X = 6.9\]

Mode in case of Discrete Frequency Distribution

In case of a discrete frequency distribution, identification of the mode is immediate; one simply finds that value which has the highest frequency.

Example

No. of Passengers X	No. of Flights f
28	1
33	1
34	2
35	3
36	5
37	7
38	10
39	13
40	8
Total	50

\[\text{Highest Frequency } f_m = 13\]

\[\text{Occurs against the } X \text{ value} = 39\]

\[\text{Mode } = x = 39\]

Mode in case of the Frequency Distribution of a Continuous Variable

\[\text{Mode } = \hat X = 1 + \frac{f_m - f_1}{(f_m - f_1) - (f_m - f_2)} xh\]

\[\text{Where}\]

\[1 = \text{ lower class boundary of the modal class}\]

\[f_m = \text{ frequency of the modal class}\]

\[f_1 = \text{ frequency of the class preceding the modal class}\]

\[f_2 = \text{ frequency of the class following modal class}\]

\[h = \text{length of class interval of the modal class}\]

Mileage Rating	Class Boundaries	No. of Cars
30.0 - 32.9	29.95 - 32.95	2
33.0 - 35.9	32.95 - 35.95	\(4 = f_1\)
36.0 - 38.9	35.95 - 38.95	\(14 = f_m\)
39.0 - 41.9	38.95 - 41.95	\(8 = f_2\)
42.0 - 44.9	41.95 - 44.95	2

It is evident that the third class is the modal class. The mode lies somewhere between 35.95 and 38.95.

\[f_m = 14\]

\[f_1 = 4\]

\[f_2 = 8\]

\[\hat X = 35.95 + \frac{14 - 4}{(14 - 4) + (14 - 8)} \times 3\]

\[= 37.825\]

Histogram

Polygon Frequency

\[\hat X = 37.825\]