Lecture No. 2
Dated: 04-11-2024
Steps Involved in Statistical Research Project
- Topic and significance of the study
- Objective of your study
- Methodology for data collection
- Source of your data
- Sampling methodology
- Instrument for collecting data
For the study objective
, it should be clear that what is it we are trying to find out.
For the data collection
, we need to consider:
- Source of your data (the statistical population)
- Sampling Methodology
- Instrument for collecting data
Collection of Data
This is the most important part of statistical work.
There are two methods
Complete Enumeration
This is also called census
.
It is the collection of whole field and is costly and time consuming because it requires a large number of enumerators and supervisory staff.
Partial Enumeration
This is associated with a sample.
Primary Data
The data
which has been collected and has not gone into any statistical treatment is called primary data
.
Collection of Primary Data
Direct Personal Investigation
The investigator personally interviews the people of concern and the data therefore, is considered accurate.
It can be costly due to the vast area but is beneficial for laboratory experiments.
Errors
can be introduced due to personal bias of the investigator.
Indirect Investigation
It is adopted for extensive inquiries.
The informants might hesitate to share information for any reason.
Therefore, the people who have the information are interviewed.
Collection through Questionnaires
These are a list of inquiry questions with some space for the information.
These are sent through email and are good for extensive inquiries because they are cheap.
However, the down side is that the respondents don't really care and provide information that is either incomplete or full of errors.
The questions should be brief and easy to understand.
Collection through Enumerators
This method involves trained enumerators assisting informants in accurately completing schedules or questionnaires.
It provides the most reliable data when the enumerator is skilled and experienced.
Ideal for large-scale government inquiries, it is typically too costly for private individuals or institutions.
Collection through Local Sources
This method relies on agents or local correspondents to gather and report information using their own judgment.
It is cost-effective and quick but provides only estimates.
Secondary Data
The data
that have been collected, classified, tabulated or presented in some form for a certain purpose, are called secondary
data.
Collection of Secondary Data
- Officials e.g. the publications of the Statistical Division, Ministry of Finance, the Federal and Provincial Bureaus of Statistics, Ministries of Food, Agriculture, Industry, Labour, etc.
- Semi-Official, e.g., State Bank of Pakistan, Railway Board, Central Cotton Committee, Boards of Economic Inquiry, District Councils, Municipalities, etc.
- Publications of Trade Associations, Chambers of Commerce, etc
- Technical and Trade Journals and Newspapers
- Research Organizations such as universities, and other institutions
Sampling
Why sample? Because accessing the entire population is rare.
Even annual censuses
are impractical for most countries.
Thus, we use samples to infer conclusions about larger populations.
Population
A statistical population
is a group sharing a characteristic, with variations among members.
Examples
Finite
- IQ's of all children in a school.
Infinite
- Barometric pressure
- A flight of migrating ducks in Canada
Frame
The target population is all college students in Punjab, while the sampled population is students from five surveyed colleges.
A clear definition of the population is crucial.
To sample randomly from a finite population, we need a complete list of its elements, called the frame
.
Sampling Frame
A sampling frame
is a complete list of all elements in a population, such as the BCS students of Virtual University of Pakistan on February 15, 2003.
It should be free from defects, including inaccuracies, incompleteness, duplication, and outdated information.
A sample, which is a part of the population, can represent it only to a limited extent; larger samples are more likely to reflect the population accurately.
As sample size approaches population size, they become more identical.
Statistical sampling aims to assess how accurately the sample describes the population, acknowledging potential compromises in accuracy while recognizing its value in data-based research.
Advantages of Sampling
- Savings in time and money
- Although unit costs are higher in a sample, total costs are lower due to the smaller sample size.
- Sample surveys are quicker than full investigations, minimizing variations over time.
- Fewer data points allow for faster and more precise processing and analysis.
- More detailed information can be obtained from each sample unit.
- Possibility of follow ups
- Queries and omissions can be addressed through follow-ups, which may not be possible in complete surveys.
- Sampling is essential for destructive tests or when dealing with an effectively infinite population.
Errors
Sampling Errors
It is the difference between the estimate derived from the sample
(the statistic) and the true population value (the parameter).
They arise because they cannot exactly represent the population.
Non Sampling Errors
In addition to sampling errors, non-sampling errors can occur during data collection, even in complete counts. Main sources of non-sampling errors include
- Defective sampling frames.
- Faulty reporting due to personal bias.
- Investigator negligence or indifference.
- Non-response to mail questionnaires.
These errors can be minimized by
- Following up on non-responses.
- Providing proper investigator training.
- Correctly processing collected data.
Sampling errors
include partial non-response (respondent answers some questions) and total non-response (respondent answers none).
Non-response issues are prevalent in human populations, especially in mail surveys, where response rates can drop to 15-20%.
Providing information about the survey's purpose can boost interest and response rates, particularly if benefits for respondents are highlighted.
Sending a prepaid addressed envelope with the questionnaire can also encourage responses. Despite these efforts, some non-response is inevitable, necessitating decisions on the number of recalls, approaching respondents multiple times to secure answers.
Finally, it’s important to determine the duration of data collection; extending the survey over time increases the potential for variations in respondents' attitudes. A defined cut-off date is essential. The next step is to explore sampling methods, focusing on the distinction between non-random and random sampling.
Random Sampling
Statistical sampling theory assumes that sample units are selected randomly, typically using a lottery method for random sampling.
Types of Random Samples
- Simple Random Sampling
- Stratified Random Sampling
- Systematic Sampling
- Cluster Sampling
- Multi-stage Sampling, Etc
Simple Random Sampling
In this sampling method, every element of the parent population has an equal chance of being included in the sample, making simple random sampling equally likely for all samples.
There is often confusion around random sampling, with haphazard selection mistakenly viewed as equivalent.
For instance, a market researcher may stop women shoppers at random in a busy area, believing he is conducting simple random sampling.
However, this approach can introduce bias, as he might preferentially question young, attractive women or those carrying brand X products.
Even without intentional bias, human selection is typically poor and subjective due to inherent psychological traits that hinder objectivity.
While traditional random sampling involves drawing names from a hat, which can be cumbersome, a more efficient method is using random number tables.
These tables contain digits from 0 to 9 arranged randomly, ensuring a lack of systematic patterns.
2315754859018372599376249708869523036744
0554555043105374350890611837441096221343
1487160350324043622350051003221154380834
3897674951940517585378805901943242871695
9731261718997553087094251258415488210513
1174269381443393087232797331182264706850
4336128859110164562393009004994364074036
9380620478382680449155751189325847552571
4954013181084298418769538296617773809527
3676872633379482156941959686704527483880
0709252392246271260706558453446733845320
4331001081448638030752555161488974294647
6157006360061736377563148951233501746993
3135283799107791894131579764486258486919
5704886526277959368290529565463506532254
0924344200687210713730729757360929827650
9795535018408948832952230825212253261587
9373259570437819888556671668269599644569
7262111225009226826435666594347168751867
6102074418453712079495917378669953619378
9783985474330559171845473541442203423000
8916097192222329063735055454898843816361
2596688220628717926502823528628491954883
8144331719050495480674690075676501716545
1132254931423623438608624976674224523245
These are generated according to certain mathematical principles so the each digit has the same chance of selection.
Example
The following frequency table of distribution gives the ages of a population of 1000 teen-age college students in a particular country.
Select a sample of 10 students using the random numbers table. Find the sample mean age and compare with the population mean age.
Age(X) | No. of Students (f) |
---|---|
13 | 6 |
14 | 61 |
15 | 270 |
16 | 491 |
17 | 153 |
18 | 15 |
19 | 4 |
1000 |
How will we proceed to select our sample of size 10 from this population of size 1000?
Step 1: Allocate a sampling number to each student.
AGE | No. of Students (f) | Cumulative Frequency (cf) |
---|---|---|
13 | 6 | 6 |
14 | 61 | 67 |
15 | 270 | 337 |
16 | 491 | 828 |
17 | 153 | 981 |
18 | 15 | 996 |
19 | 4 | 1000 |
1000 |
AGE | No. of Students (f) | Cumulative Frequency (cf) | Sampling Numbers |
---|---|---|---|
13 | 6 | 6 | 000 – 005 |
14 | 61 | 67 | 006 – 066 |
15 | 270 | 337 | 067 – 336 |
16 | 491 | 828 | 337 – 827 |
17 | 153 | 981 | 828 – 980 |
18 | 15 | 996 | 981 – 995 |
19 | 4 | 1000 | 996 – 999 |
1000 |
The first student is assigned the sampling number 000, the second 001, and the last student (1000th) 999.
This system uses three-digit numbers instead of four-digit ones for simplicity.
To select 10 random numbers, one can close their eyes and let their finger point to a spot on the random number table, reading three adjacent digits.
In this example, the random numbers selected are 041, 103, 374, 171, 508, 652, 880, 066, 715, 471, which correspond to ages: 14, 15, 16, 15, 16, 16, 17, 15, 16, 16.
Explanation
Age(x) | No. of Students (f) | \(f \times x\) |
---|---|---|
13 | 6 | 78 |
14 | 61 | 854 |
15 | 270 | 4050 |
16 | 491 | 7856 |
17 | 153 | 2601 |
18 | 15 | 270 |
19 | 4 | 76 |
1000 | 15785 |
The population mean
is
Ages selected in the sample
are
- 14
- 15
- 16
- 15
- 16
- 16
- 17
- 15
- 16
- 16
The small error results from using the random sampling method, which increases the likelihood that the sample accurately represents the population.
This means that estimates derived from the sample closely reflect those from the population, as a sample is meant to be a miniature replica of it.
There are also various other types of random sampling.
Other Sampling Types
- Stratified sampling (if the population is heterogeneous)
- Systematic sampling (practically, more convenient than simple random sampling)
- Cluster sampling (sometimes the sampling units exist in natural clusters)
- Multi-stage sampling
All these designs utilize random or quasi-random sampling, which is a form of probability sampling where each unit has a known (but not equal) chance of selection.
This allows for objective calculation of estimate precision and reliability.
Typically, multiple sampling techniques are combined in survey designs, and simple random samples or multi-stage designs are seldom used without stratification.
It's essential to exercise care at every step to ensure the reliability of the results.