Trend analysis is a statistical procedure performed to evaluate hypothetical linear and nonlinear relationships between two quantitative variables. Typically, it is applied as an analysis of variance (ANOVA) for quantitative variables or as a regression analysis. In this article we will describe you the trend analysis.
It is often used in situations where data have been collected over time or at different levels of a variable, especially when a single independent variable or factor has been manipulated, to observe its effects on a dependent variable or response variable ( as in the experimental studies). In particular, the means of a dependent variable are observed across the conditions, levels, or points of the manipulated independent variable to statistically determine the shape, shape, or trend of that relationship.
Trend analysis quantifies and explains trends and patterns in “noisy” data over time. A trend is an upward or downward change in a data set over time.
Need to Understand and Quantify Change
The need to understand and quantify change is central to all sciences. It may be about describing past variation, understanding the mechanisms underlying observed changes, making projections of possible future changes, or monitoring the effect of the intervention on some system.
Practitioners will be familiar with many classical statistical procedures for trend detection and estimation. However, the increasing capacity to collect and process large amounts of information has led to the realization that such procedures are limited in terms of the insights they can provide.
Whether we want to predict the trend of financial markets or electricity consumption, time is an important factor to take into account in our models. For example, it would be interesting to predict at what time of day a peak in electricity consumption will occur in order to adjust the price or production of electricity.
Temporal series
A time series is simply a series of data points ordered in time. In a time series, time is usually the independent variable, and the goal is usually to forecast the future.
However, there are other aspects that come into play when it comes to time series.
Is it stationary?
Is there a seasonality?
Is the target variable autocorrelated?
autocorrelation
Informally, autocorrelation is the similarity between observations as a function of the time lag between them.
seasonality
Seasonality refers to periodic fluctuations. For example, electricity consumption is high during the day and low at night, or online sales increase during Christmas before falling again.
stationarity
A time series is said to be stationary if its statistical properties do not change over time. In other words, it has a constant mean and variance, and the covariance is independent of time.
Trend Search
There are several tools for analyzing data trends. They range from relatively simple ones (like linear regression) to more complex tools like the Mann-Kendall test, which can be used to look for non-linear trends. Other popular tools are as follows:
Autocorrelation analysis: Autocorrelation occurs when the error terms of a time series carry over from one period to another.
Curve fitting: It is useful for modeling specific trends. For example, you can try to fit a growth curve such as a Gompertz distribution to your data.
Filtering or smoothing: Filtering extracts a trend from a noisy data set, while smoothing assigns a weight (ie, a higher priority) to more recent data.
The Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test determines whether a time series is stationary around a mean or linear trend or non-stationary due to a unit root.
MANCOVA (multivariate analysis of covariance) is the multivariate equivalent of ANCOVA. Indicates whether the group differences are likely to have occurred by chance or whether there is a repeatable trend.
The seasonal Kendall test (SK test) analyzes the data looking for monotone trends in the seasonal data.
Trend setting
Various statistical hypothesis tests have been developed to explore whether there is something more interesting in one or more data sets than what might be expected from random fluctuations in Gaussian noise. The simplest of these tests is known as linear regression or ordinary least squares.
The basic idea is that we test an alternative hypothesis that posits a linear relationship between the independent variable (we’ll call it x) and the dependent variable (we’ll use the generic variable y).
The underlying statistical model for the data is:
yi=a+b⋅χi+εi
Where i goes from 1 to N, a is the intercept of the linear relationship between y and x, b is the slope of that relationship, and ε is a random noise sequence. The simplest assumption is that ε is Gaussian white noise, but sometimes we will be forced to relax that assumption.
Linear regression
Linear regression determines the values of a and b that best fit the given data by minimizing the sum of the squared differences between the observations y and the values predicted by the linear model yˆ=a+bx. The residuals are our estimate of the variation in the data that is not explained by the linear relationship and are defined as:
εi=yi-yˆi
For simple linear regression, that is, ordinary least squares, the estimates of a and b are easily obtained:
b=[N⋅Σyixi-Σyi⋅Σxi][N⋅Σxi2-Σ(xi)2]
Y
a=(1N/)⋅Σyi-bN⋅Σxi
The parameter that interests us most is b, since it is the one that determines whether or not there is a significant linear relationship between y and x.
Sample Uncertainty
The sample uncertainty in b can also be easily obtained:
σb=std(ε)[Σ(xi-μ(x))2]12
Where std(ε) is the standard deviation of ε and μ is the mean of x. A statistically significant trend is equivalent to the finding that b is significantly different from zero. The 95% confidence interval for b is given by b±2σb . If this interval does not cross zero, then it can be concluded that b is significantly different from zero. Alternatively, we can measure significance in terms of the linear correlation coefficient, r, between the independent and dependent variables, which is related to b by:
r=b⋅std(x)std(y)
r is calculated directly from the data:
r=(1N-1/)⋅Σ(xx¯)(yy¯)std(x)⋅std(y)
Where the top bar indicates the mean. Unlike b, which has dimensions (for example, degrees °C per year in the case where y is temperature and x is time), r is conveniently a dimensionless number whose absolute value is between 0 and 1. The larger the value of r (either positive or negative), the more significant the trend. In fact, the square of r (r2) is a measure of the fraction of variation in the data that is explained by the trend.
p-value
We measure the importance of any detected trend in terms of a p-value. The p-value is an estimate of the probability that we will erroneously reject the null hypothesis that there is no trend in the data in favor of the alternative hypothesis that there is a linear trend in the data, the signal we are looking for in this case.
Therefore, the smaller the p-value, the less likely it is that a trend as large as that found in the data from random fluctuations alone will be observed. By convention, p<0.05 is often required to conclude that there is a significant trend (ie, only 5% of the time such a trend would have occurred by chance alone), but it is not a magic number.
Moving average
The moving average model is probably the most naive approach to time series modelling. This model simply states that the next observation is the mean of all past observations.
Although simple, this model can be surprisingly good and represents a good starting point.
On the other hand, the moving average can be used to identify interesting trends in the data. We can define a window to apply the moving average model to smooth the time series and highlight the different trends.
exponential smoothing
Exponential smoothing uses similar logic to moving average, but this time a different decreasing weight is assigned to each of the observations. In other words, observations are given less importance as we move away from the present.
Double Exponential Smoothing
Double exponential smoothing is used when there is a trend in the time series. In that case, we use this technique, which is nothing more than a recursive use of exponential smoothing twice.
Triple exponential smoothing
This method extends the double exponential smoothing by adding a seasonal smoothing factor. Of course, this is useful if seasonality is observed in the time series.
Integrated Seasonal Autoregressive Moving Average (SARIMA) model
SARIMA is actually the combination of simpler models to create a complex model that can model time series that have non-stationary and seasonal properties.
Potential weaknesses of trend analysis
Although trend analysis can be extremely useful in many applications, from climate change to sociological analysis, it is important to note that it is not foolproof. In particular:
All data (unless collected through a population census) is subject to sampling error. The extent of this problem will increase when coarse sampling methods (eg convenience sampling) are used.
The data is likely to be subject to random, systematic, or external measurement error. Trends in this error can be confused with trends in actual data.
Short-term “ghost” trends exist even in the most random of number sequences, so trends should be followed for as long as possible.
Also, finding no trend can mean there is no trend, but it can also mean that the data is insufficient to illuminate a trend that does exist.
We hope that you have understood the concept of trend analysis.