Descriptive statistics refers to those brief descriptive coefficients that summarize a given data set, which can be a representation of the entire population or a sample of a population. Descriptive statistics are divided into measures of central tendency and measures of variability (dispersion). Measures of central tendency include the mean, median, and mode, while measures of variability include the standard deviation, variance, minimum and maximum variables, kurtosis, and skewness.
Descriptive statistics are used to describe the basic characteristics of a study’s data. Provides simple summaries about the sample and measurements. Along with simple graphical analysis, they form the basis for virtually all quantitative data analysis.
Differences with Inferential Statistics
Descriptive statistics is usually distinguished from inferential statistics. With descriptive statistics, you simply describe what the data is or shows. With inferential statistics, you try to reach conclusions that go beyond the immediate data. For example, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments about the probability that an observed difference between groups is reliable or occurred by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what is happening in our data.
Uses of Descriptive Statistics
Descriptive statistics is used to present quantitative descriptions in a manageable way. In a research study we may have many measures. Or we may measure a large number of people to some extent. Descriptive statistics help us to simplify large amounts of data in a reasonable way.
Descriptive statistics reduce a lot of data to a simpler summary. For example, consider a simple number used to summarize a hitter’s performance in baseball, the batting average. This number is simply the number of hits divided by the number of times batted (to three significant digits). A batter batting .333 hits once every three times at bat. One who hits .250 is hitting once in four. The single number describes a large number of discrete events. Or consider the scourge of many students, the grade point average (GPA). This single number describes a student’s overall performance across a potentially wide range of course experiences.
Any time you try to describe a large set of observations with a single indicator, you run the risk of distorting the original data or missing important details. The batting average does not indicate whether the batter is hitting home runs or singles. It doesn’t tell you if he’s had a losing streak or a streak. The GPA does not say whether the student has been in hard or easy courses, or whether they were courses in their major field or in other disciplines. Even taking these limitations into account, descriptive statistics offer a powerful summary that can allow comparisons between people or other units.
Why do we need statistics that simply describe the data?
Descriptive statistics is used to describe or summarize the characteristics of a sample or data set, such as the mean, standard deviation, or frequency of a variable. Inferential statistics can help us understand the collective properties of the elements in a data sample. Knowing the sample mean, variance, and distribution of a variable can help us understand the world around us.
Understand Descriptive Statistics
Descriptive statistics, in short, helps to describe and understand the characteristics of a specific data set by providing brief summaries about the data sample and measures. The most recognized types of descriptive statistics are measures of center: the mean, median, and mode, which are used at almost every level of mathematics and statistics. The mean, or average, is calculated by adding all the digits in the data set and dividing by the number of digits in the set.
For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a data set is the value that appears most often, and the median is the figure in the middle of the data set. It is the figure that separates the highest figures from the lowest within a data set. However, there are less common types of descriptive statistics that are still very important.
Practitioners use descriptive statistics to recast hard-to-understand quantitative insights from a large data set into bite-size descriptions. A student’s grade point average (GPA), for example, provides a good insight into descriptive statistics. The idea of a GPA is that it takes data points from a wide range of tests, classes, and grades, and averages them together to provide a general understanding of a student’s overall academic performance. A student’s personal GPA reflects the student’s average academic performance.
Types of Measures in Descriptive Statistics
All measures in descriptive statistics are measures of central tendency or measures of variability, also known as measures of dispersion.
central tendency
Measures of central tendency focus on the mean values of the data sets, while measures of variability focus on the spread of the data. These two measures use graphs, tables, and general discussions to help understand the meaning of the analyzed data.
Measures of central tendency describe the central position of a distribution for a set of data. One person analyzes the frequency of each data point in the distribution and describes it using the mean, median, or mode, which measures the most common patterns in the analyzed data set.
Variability measures
Variability measures (or measures of dispersion) help analyze the dispersion of the distribution of a data set. For example, although measures of central tendency can give a person the mean of a data set, they do not describe how the data is distributed within the set.
Thus, while the mean of the data may be 65 out of 100, there may still be data points in both 1 and 100. Variability measures help communicate this by describing the shape and spread of the data set. The range, quartiles, absolute deviation, and variance are examples of measures of variability.
Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which is calculated by subtracting the lowest number (5) in the data set from the highest (100) .
What are the mean and standard deviation?
They are two descriptive statistics tools that are commonly used. The mean is the average level observed in some data, while the standard deviation describes the variance, or how spread out the observed data are in that variable around its mean.
Can descriptive statistics be used to make inferences or predictions?
No. Descriptive statistics can be useful for two purposes: 1) to provide basic information about the variables in a data set and 2) to highlight possible relationships between the variables. The most common methods can be shown graphically or pictorially and are:
Graphic/pictorial methods
There are several graphical and pictorial methods that enhance researchers’ understanding of individual variables and the relationships between variables. Graphical and pictorial methods provide a visual representation of the data. Some of these methods are:
Histograms
scatter plots
Geographic Information Systems (GIS)
sociograms
Histograms
scatter plots
They show the relationship between two quantitative or numerical variables by plotting one variable against the value of another variable.
For example, one axis of a scatter plot might represent height and the other might represent weight. Each person in the data would receive a data point on the scatter plot that corresponds to their height and weight.
Geographic Information Systems (GIS)
A GIS is a computer system capable of capturing, storing, analyzing and displaying geographically referenced information, that is, data identified according to its location.
Using a GIS program, a researcher can create a map to visually represent relationships between data
sociograms
They show networks of relationships between variables, allowing researchers to identify the nature of relationships that would otherwise be too complex to conceptualize.
Measures of central tendency
Measures of central tendency are the most basic and often the most informative description of the characteristics of a population. They describe the “middle” member of the population of interest. There are three measures of central tendency:
Mean: The sum of the values of a variable divided by the total number of values.
Median: the mean value of a variable
Mode: the value that occurs most often
Example:
The incomes of five randomly selected people in the United States are $10,000, $10,000, $45,000, $60,000, and $1,000,000.
Median income = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000
Median income = $45,000
Modal income = $10,000
The mean is the most widely used measure of central tendency. Medians are generally used when a few values are extremely different from the rest of the values (this is called a skewed distribution). For example, median earnings are often the best measure of median earnings because, while most individuals earn between $0 and $200,000, a handful of individuals earn millions.
Measures of dispersion
Dispersion measures provide information about the dispersion of the values of a variable. There are four key measures of dispersion:
Range
variance
Standard deviation
Bias
The range is simply the difference between the smallest and largest values in the data. The interquartile range is the difference between the 75th percentile and the 25th percentile values of the data.
Variance is the most widely used measure of dispersion. It is calculated by taking the mean of the squared differences between each value and the mean.
The standard deviation, another commonly used statistic, is the square root of the variance.
Skewness is a measure of whether some values of a variable are extremely different from most values. For example, earnings are skewed in that most people earn between $0 and $200,000, but a handful of people earn millions. A variable is positively biased if the extreme values are higher than the majority of the values. A variable is negatively biased if the extreme values are lower than the majority of the values.
Example:
The incomes of five randomly selected people in the United States are $10,000, $10,000, $45,000, $60,000, and $1,000,000:
Range = 1,000,000 – 10,000 = 990,000
Variance = [(10,000 – 225,000)2 + (10,000 – 225,000)2 + (45,000 – 225,000)2 + (60,000 – 225,000)2 + (1,000,000 – 225,000)2]. / 5 = 150,540,000,000
Standard Deviation = Square Root (150,540,000,000) = 387,995
Bias = Income is biased positively
Association Measures
Association measures indicate whether two variables are related. Two measures are commonly used:
Chi squared
As a measure of association between variables, chi-square tests are used with nominal data (that is, data that are classified into classes: for example, sex [male, female] and type of work [unskilled, semi-skilled, skilled] ) to determine if they are associated*.
A chi-square is called significant if there is an association between two variables, and not significant if there is no association.
To check for associations, the chi-square is calculated as follows: Suppose a researcher wants to know if there is a relationship between gender and two types of jobs, construction worker and administrative assistant. To perform a chi-square test, the researcher counts the number of female administrative assistants, the number of female construction workers, the number of male administrative assistants, and the number of male construction workers in the data.
These counts are compared to the number that would be expected in each category if there were no association between job type and gender (this expected count is based on statistical calculations). If there is a large difference between the observed values and the expected values, the chi-square test is significant, indicating that there is an association between the two variables.
*The chi-square test can also be used as a measure of goodness of fit, to check whether the data in a sample come from a population with a specific distribution, as an alternative to the Anderson-Darling and Kolmogorov goodness-of-fit tests. Smirnov. As such, the chi-square test is not limited to nominal data; however, with unclassified data, the results depend on how the intervals or classes are created and the size of the sample.
Correlation
The correlation coefficient is used to measure the strength of the relationship between numerical variables (for example, weight and height).
The most common correlation coefficient is Pearson’s r, which can range from -1 to +1.
If the coefficient is between 0 and 1, when one variable increases, the other also does. This is called positive correlation. For example, height and weight are positively correlated because taller people tend to weigh more.
If the correlation coefficient is between -1 and 0, as one variable increases, the other decreases. This is called negative correlation. For example, age and hours of sleep per night are negatively correlated because older people tend to sleep fewer hours per night.