




Introduction to Biostatistics 

statistics are simply a collection of tools that researchers employ to help answer research questions


DESCRIPTIVE STATISTICS
MEASURES OF CENTRAL TENDANCY
 A measure of central tendency is a single number used to represent the centre of a grouped data.
 The basic measures are;
 For any symmetrical distribution, the mean, median, and mode will be identical.
 Each measure is designed to represent a typical score.
 The choice of which measure to use depends on:
 the shape of the distribution (whether normal or skewed), and
 the variable’s “level of measurement” (data are nominal, ordinal or interval).
Mean
 The mean (or average) is found by adding all the numbers and then dividing by how many numbers you added together.
 Most common measure of central tendency.
 Formula for calculation of mean:
 Best for making predictions.
 Applicable under two conditions:
 scores are measured at the interval level, and
 distribution is more or less normal [symmetrical].
Example:
 3,4,5,6,7
 3+4+5+6+7= 25
 25 divided by 5 = 5
 The mean is 5

 Advantages of mean
 Mathematical center of a distribution.
 Good for interval and ratio data.
 Does not ignore any information.
 Inferential statistics is based on mathematical properties of the mean.
 Disadvantages of mean
 Influenced by extreme scores and skewed distributions.
 May not exist in the data.
Median
 When the numbers are arranged in numerical order, the middle one is the median.
 50% of observations are above the Median, 50% are below it.
 Formula Median = n + 1 / 2.
Example:
 3,6,2,5,7
 Arrange in order 2,3,5,6,7
 The number in the middle is 5
 The median is 5

 Advantages:
 Not influenced by extreme scores or skewed distribution.
 Good with ordinal data.
 Easier to compute than the mean.
 Considered as the typical observation.
 Disadvantages:
 May not exist in the data.
 Does not take actual values into account.
Mode
 The number that occurs most frequently is the mode.
 We usually find the mode by creating a frequency distribution in which we count how often each value occurs.
 If we find that every value occurs only once, the distribution has no mode.
 If we find that two or more values are tied as the most common, the distribution has more than one mode.
Example:
 2,2,2,4,5,6,7,7,7,7,8
 The number that occurs most frequently is 7
 The mode is 7

 Advantages:
 Good with nominal data.
 Bimodal distribution might verify clinical observations (pre and postmenopausal breast cancer).
 Easy to compute and understand.
 The score exists in the data set.
 Disadvantages:
 Ignore most of the information in a distribution.
 Small samples may not have a mode
 More than one mode might exist.
Appropriate Measures of Central Tendency
 Nominal variables  Mode
 Ordinal variables  Median
 Interval level variables  Mean
 If the distribution is normal (median is better with skewed distribution)

MEASURES OF VARIABILITY
“If there is no variability within populations there would be no need for statistics.”
 Three indices are used to measure variation or dispersion among scores:
 range
 variance, and
 standard deviation (Cozby, 2000).
 These indices answer the question: How Spread out is the distribution?
 Dispersion/Deviation/Spread tells us a lot about how a variable is distributed.
Range
 Range is the simplest method of examining variation among scores
 It refers to the difference between the highest and lowest values produced.
 For continuous variables, the range is the arithmetic difference between the highest and lowest observations in the sample. In the case of counts or measurements, 1 should be added to the difference because the range is inclusive of the extreme observations.
 Another statistic, known as the interquartile range, describes the interval of scores bounded by the 25th and 75th percentile ranks; the interquartile range is bounded by the range of scores that represent the middle 50 percent of the distribution.
Percentiles (or quartiles)
 The First quartile is the 25th percentile (noted Q1),
 the Median value is the 50th percentile (noted Median), and
 the Third quartile is the 75th percentile (noted Q3).
 ‘’ A percentile is a value at or below which a given percentage or fraction of the variable values lie.”
 The pth percentile is the value that has p% of the measurements below it and (100p)% above it.
 Thus, the 20th percentile is the value such that one fifth of the data lie below it. It is higher than 20% of the data values and lower than 80% of the data values.’’
 E.g. if you are in the 80th percentile on a real GMAT result, you scored better on that section than 80% of the students taking the GMAT.
Standard deviation
 The standard deviation is the most widely applied measure of variability.
 It shows how much variation there is from the "average" (mean).
 Large standard deviations suggest that scores are probably widely scattered.
 Small standards deviations suggest that there is very little deference among scores.
 Computational formula for S.D:
Example: (Adapted from Wikipedia)
 Consider a population consisting of the following values:
 There are eight data points in total, with a mean (or average) value of 5:
 To calculate the population standard deviation, first compute the difference of each data point from the mean, and square the result:
 Next divide the sum of these values by the number of values and take the square root to give the standard deviation:
 Therefore, the above has a population standard deviation of 2.
Variance
 The squire of the standard deviation is the variance.














\



