Stationarity in time series analysis

A review of the concept and types of stationarity

Shay Palachy Affek
Towards Data Science

--

This post is meant to provide a concise but comprehensive overview of the concept of stationarity and of the different types of stationarity defined in academic literature dealing with time series analysis.

Future posts will aim to provide similarly concise overviews of detection of non-stationarity in time series data and of the different ways to transform non-stationary time series into stationary ones.¹

Why is stationarity important?

Before diving into formal definitions of stationarity, and the related concepts upon which it builds, it is worth considering why the concept of stationarity has become important in time series analysis and its various applications.

In the most intuitive sense, stationarity means that the statistical properties of a process generating a time series do not change over time. It does not mean that the series does not change over time, just that the way it changes does not itself change over time. The algebraic equivalent is thus a linear function, perhaps, and not a constant one; the value of a linear function changes as 𝒙 grows, but the way it changes remains constant — it has a constant slope; one value that captures that rate of change.

Figure 1: Time series generated by a stationary (top) and a non-stationary (bottom) processes.

Why is this important? First, because stationary processes are easier to analyze. Without a formal definition for processes generating time series data (yet; they are called stochastic processes and we will get to them in a moment), it is already clear that stationary processes are a sub-class of a wider family of possible models of reality. This sub-class is much easier to model and investigate. The above informal definition also hints that such processes should be possible to predict, as the way they change is predictable.

Although it sounds a bit streetlight effect-ish that simpler theories or models should become more prominent, it is actually quite a common pattern in science, and for good reason. In many cases simple models can be surprisingly useful, either as building blocks in constructing more elaborate ones, or as helpful approximations to complex phenomena. As it turns out, this also true for stationary processes.

Due to these properties, stationarity has become a common assumption for many practices and tools in time series analysis. These include trend estimation, forecasting and causal inference, among others.

The final reason, thus, for stationarity’s importance is its ubiquity in time series analysis, making the ability to understand, detect and model it necessary for the application of many prominent tools and procedures in time series analysis. Indeed, for many cases involving time series, you will find that you have to be able to determine if the data was generated by a stationary process, and possibly to transform it so it has the properties of a sample generated by such a process.

Hopefully, I have convinced you by now that understanding stationarity is important if you want to deal with time series data, and we can proceed to introducing the subject more formally.

A formal definition for stochastic processes

Before introducing more formal notions for stationarity, a few precursory definitions are required. This section is meant to provide a quick overview of basic concepts in time series analysis and stochastic process theory required for further reading. Feel free to skip ahead if you are familiar with them.

Time series: Commonly, a time series (x, …, xₑ) is assumed to be a sequence of real values taken at successive equally spaced points in time, from time t=1 to time t=e.

Lag: For some specific time point r, the observation xᵣ₋ᵢ (i periods back) is called the i-th lag of xᵣ. A time series Y generated by back-shifting another time series X by i time steps is also sometime called the i-th lag of X, or an i-lag of X. This transformation is called both the backshifting operator, commonly denoted as B(∙),and the lag operator, commonly denoted as L(); thus, L(Xᵣ)=Xᵣ₋₁. Powers of the operators are defined as Lⁱ(Xᵣ)=Xᵣ₋ᵢ.

Stochastic Processes

A common approach in the analysis of time series data is to consider the observed time series as part of a realization of a stochastic process. Two cursory definitions are required before defining stochastic processes.

Probability Space: A probability space is a triple (Ω, F, P), where
(i) Ω is a nonempty set, called the sample space.
(ii) F is a σ-algebra of subsets of Ω, i.e. a family of subsets closed with respect to countable union and complement with respect to Ω.
(iii) P is a probability measure defined for all members of F.

Random Variable: A real random variable or real stochastic variable on (Ω,F,P) is a function x:Ω→ℝ, such that the inverse image of any interval (-∞,a] belongs to F; i.e. a measurable function.

We can now define what is a stochastic process.

Stochastic Process: A real stochastic process is a family of real random variables 𝑿={x(ω); iT}, all defined on the same probability space (Ω, F, P). The set T is called the index set of the process. If T⊂ℤ, then the process is called a discrete stochastic process. If T is an interval of ℝ, then the process is called a continuous stochastic process.

Finite Dimensional Distribution: For a finite set of integers T={t₁, …,tn}, the joint distribution function of 𝑿={X(ω); iT} is defined by

Equation 1: The joint distribution function.

Which for a stochastic process 𝑿 is also commonly denoted as:

The finite dimensional distribution of a stochastic process is then defined to be the set of all such joint distribution functions for all such finite integer sets T of any size n. For a discrete process it is thus the set:

Equation 2: Finite dimensional distribution for a discrete stochastic process.

Intuitively, this represents a projection of the process onto a finite-dimensional vector space (in this case, a finite set of time points).

Definitions of stationarity

Having a basic definition of stochastic processes to build on, we can now introduce the concept of stationarity.

Intuitively, stationarity means that the statistical properties of the process do not change over time. However, several different notions of stationarity have been suggested in econometric literature over the years.

An important distinction to make before diving into these definitions is that stationarity — of any kind — is a property of a stochastic process, and not of any finite or infinite realization of it (i.e. a time series of values).

Strong stationarity

Strong stationarity requires the shift-invariance (in time) of the finite-dimensional distributions of a stochastic process. This means that the distribution of a finite sub-sequence of random variables of the stochastic process remains the same as we shift it along the time index axis. For example, all i.i.d. stochastic processes are stationary.³

Formally, the discrete stochastic process 𝑿={x; i∈ℤ} is stationary if

Equation 3: The stationarity condition.

for T with n∈ℕ and any τ∈. [Cox & Miller, 1965] For continuous stochastic processes the condition is similar, with T⊂ℝ, n∈ℕ and any τ∈ℝ instead.

This is the most common definition of stationarity, and it is commonly referred to simply as stationarity. It is sometimes also referred to as strict-sense stationarity or strong-sense stationarity.

Note: This definition does not assume the existence/finiteness of any moment of the random variables composing the stochastic process!

Weak stationarity

Weak stationarity only requires the shift-invariance (in time) of the first moment and the cross moment (the auto-covariance). This means the process has the same mean at all time points, and that the covariance between the values at any two time points, t and t−k, depend only on k, the difference between the two times, and not on the location of the points along the time axis.

Formally, the process {x; i∈ℤ} is weakly stationary if:
1. The first moment of xᵢ is constant; i.e. ∀t, E[x]=𝜇
2. The second moment of xᵢ is finite for all t; i.e. ∀t, E[x²]<∞ (which also implies of course E[(x-𝜇)²]<∞; i.e. that variance is finite for all t)
3. The cross moment — i.e. the auto-covariance — depends only on the difference u-v; i.e. ∀u,v,a, cov(xᵤ, xᵥ)=cov(xᵤ₊ₐ, xᵥ₊ₐ)

The third condition implies that every lag 𝜏ℕ has a constant covariance value associated with it:

Note that this directly implies that the variance of the process is also constant, since we get that for all t

This paints a specific picture of weakly stationary processes as those with constant mean and variance. Their properties are contrasted nicely with those of their counterparts in Figure 2 below.

Figure 2: Constancy in mean and variance.

Other common names for weak stationarity are wide-sense stationarity, weak-sense stationarity, covariance stationarity and second order stationarity². Confusingly enough, it is also sometimes referred to simply as stationarity, depending on context (see [Boshnakov, 2011] for an example); in geo-statistical literature, for example, this is the dominant notion of stationarity. [Myers, 1989]

Note: Strong stationarity does not imply weak stationarity, nor does the latter implies the former (see example here)! An exception are Gaussian processes, for which weak stationarity does imply strong stationarity.
The reason strong stationarity does not imply weak stationarity is that it does not mean the process necessarily has a finite second moment; e.g. an IID process with standard Cauchy distribution is strictly stationary but has no finite second moment (see [Myers, 1989]). Indeed, having a finite second moment is a necessary and sufficient condition for the weak stationarity of a strongly stationary process.

White Noise Process: A white noise process is a serially uncorrelated stochastic process with a mean of zero and a constant and finite variance.

Formally, the process {x; i∈ℤ} is a white noise process if:
1. The first moment of xᵢ is always zero; i.e. ∀t, E[x]=0
2. The second moment of xᵢ is finite for all t; i.e. ∀t, E[(x-𝜇)²]<∞
3. The cross moment E[xᵤ xᵥ] is zero when u≠v; i.e. ∀u,v w. u≠v, cov(xᵤ, xᵥ)=0

Note that this implies that every white noise process is a weak stationary process. If, additionally, every variable xᵢ follows a normal distribution with zero mean and the same variance σ², then the process is said to be a Gaussian white noise process.

N-th order stationarity

Very close to the definition of strong stationarity, N-th order stationarity demands the shift-invariance (in time) of the distribution of any n samples of the stochastic process, for all n up to order N.

Thus, the same condition is required:

Equation 4: The N-th order stationarity condition.

for T with n{1,…,N} and any τ∈.

Naturally, stationarity to a certain order N does not imply stationarity of any higher order (but the inverse is true). An interesting thread in mathoverflow showcases both an example of a 1st order stationary process that is not 2nd order stationary, and an example for a 2nd order stationary process that is not 3rd order stationary.

Note that stationarity of the N-th order for N=2 is surprisingly not equivalent to weak stationarity, even though the latter is sometimes referred to as second-order stationarity. [Myers, 1989] Like with strong stationarity, the condition which 2nd order stationarity sets for the distribution of any two samples of 𝑿 does not imply that 𝑿 has finite moments. And similarly, having a finite second moment is a sufficient and necessary condition for a 2nd order stationary process to also be a weakly stationary process.

First-order stationarity

The term first-order stationarity is sometimes used to describe a series that has means that never changes with time, but for which any other moment (like variance) can change.[Boshnakov, 2011]

Again, note that this definition is not equivalent to N-th order stationarity for N=1, as the latter entails that xᵢ are all identically distributed for a process 𝑿={x; iℤ}. For example, a process where xᵢ~𝓝(𝜇,f(i)) where f(i)=1 for even values of i and f(i)=2 for odd values has a constant mean over time, but xᵢ are not identically distributed. As a result, such a process pertains to this specific definition of first-order stationarity, but not to N-th order stationarity for N=1.

Cyclostationarity

A stochastic process is cyclostationary if the joint distribution of any set of samples is invariant over a time shift of mP, where mand P∈ℕ is the period of the process:

Equation 5: The cyclostationarity condition.

Cyclostationarity is prominent in signal processing.

Figure 3: A white noise process 𝑛(𝑡) modulated by sin2𝜔𝑡 produces the cyclostationary process 𝑥(𝑡)=𝑛(𝑡)sin2𝜔𝑡

Trend stationarity

A stochastic process is trend stationary if an underlying trend (function solely of time) can be removed, leaving a stationary process. Meaning, the process can be expressed as yᵢ=f(i)+εᵢ, where f(i) is any function f:ℝ→ℝ and εᵢ is a stationary stochastic process with a mean of zero.

In the presence of a shock (a significant and rapid one-off change to the value of the series), trend-stationary processes are mean-reverting; i.e. over time, the series will converge again towards the growing (or shrinking) mean, which is not affected by the shock.

Figure 4: Trend stationary processes revert to their mean after a shock is applied.

Joint stationarity

Intuitive extensions exist of all of the above types of stationarity for pairs of stochastic processes. For example, for a pair of stochastic process 𝑿 and 𝒀, joint strong stationarity is defined by the same condition of strong stationarity, but is simply imposed on the joint cumulative distribution function of the two processes. Weak stationarity and N-th order stationarity can be extended in the same way (the latter to M-N-th order joint stationarity).

The intrinsic hypothesis

A weaker form of weak stationarity, prominent in geostatistical literature (see [Myers 1989] and [Fischer et al. 1996], for example). The intrinsic hypothesis holds for a stochastic process 𝑿={Xᵢ} if:

  1. The expected difference between values at any two places separated by distance r is zero: E[xᵢ-xᵢ₊ᵣ]=0
  2. The variance of differences, given by Var[xᵢ-xᵢ₊ᵣ], exists (i.e. it’s finite) and depends only the distance r.

This notion implies weak stationarity of the difference Xᵢ-Xᵢ₊ᵣ, and was extended with a definition of N-th order intrinsic hypothesis.

Locally stationary stochastic processes

An important class of non-stationary processes are locally stationary (LS) processes. One intuitive definition for LS processes, given in [Cardinali & Nason, 2010], is that their statistical properties change slowly over time. Alternatively, [Dahlhaus, 2012] defines them (informally) as processes which locally at each time point are close to a stationary process but whose characteristics (covariances, parameters, etc.) are gradually changing in an unspecific way as time evolves. A formal definition can be found in [Vogt, 2012], and [Dahlhaus, 2012] provides a rigorous review of the subject.

LS processes are of importance because they somewhat bridge the gap between the thoroughly explored sub-class of parametric non-stationary processes (see the following section) and the uncharted waters of the wider family of non-parametric processes, in that they have received rigorous treatment and a corresponding set of analysis tools akin to those enjoyed by parametric processes. A great online resource on the topic is the home page of Prof. Guy Nason, who names LS processes as his main research interest.

The typology of notions of stationarity

The following typology figure, partial as it may be, can help understand the relations between the different notions of stationarity we just went over:

Figure 5: Types of non-stationary processes

Parametric notions of non-stationarity

The definitions of stationarity presented so far have been non-parametric; i.e., they did not assume a model for the data-generating process, and thus apply to any stochastic process. The related concept of a difference stationarity and unit root processes, however, requires a brief introduction to stochastic process modeling.

The topic of stochastic modeling is also relevant insofar as various simple models can be used to create stochastic processes (see figure 5).

Figure 6: Various non-stationary processes (the purple white noise process is an exception).

Basic concepts in stochastic process modeling

The forecasting of future values is a common task in the study of time series data. To make forecasts, some assumptions need to be made regarding the Data Generating Process (DGP), the mechanism generating the data. These assumptions often take the form of an explicit model of the process, and are also often used when modeling stochastic processes for other tasks, such as anomaly detection or causal inference. We will go over the three most common such models.

The autoregressive (AR) model: A time series modeled using an AR model is assumed to be generated as a linear function of its past values, plus a random noise/error:

Equation 4: The autoregressive model.

This is a memory-based model, in the sense that each value is correlated with the p preceding values; an AR model with lag p is denoted with AR(p). The coefficients 𝜙 are weights measuring the influence of these preceding values on the value x[t], c is constant intercept and ε is a univariate white noise process (commonly assumed to be Gaussian).

The vector autoregressive (VAR) model generalizes the univariate case of the AR model to the multivariate case; now each element of the vector x[t] of length k can be modeled as a linear function of all the elements of the past p vectors:

Equation 5: The vector autoregressive model.

where c is a vector of k constants (the intercepts), Aᵢ are time-invariant k×k matrices and e={e; i∈ℤ} is a white noise multivariate process of k variables.

The moving average (MA) model: A time series modeled using a moving average model, denoted with MA(q), is assumed to be generated as a linear function of the last q+1 random shocks generated by εᵢ, a univariate white noise process:

Equation 6: The moving average model.

Like for autoregressive models, a vector generalization, VMA, exists.

The autoregressive moving average (ARMA) model: A time series modeled using an ARMA(p,q) model is assumed to be generated as a linear function of the last p values and the last q+1 random shocks generated by εᵢ, a univariate white noise process:

Equation 7: The ARMA model.

The ARMA model can be generalized in a variety of ways, for example to deal with non-linearity or with exogenous variables, to the multivariate case (VARMA) or to deal with (a specific type of) non-stationary data (ARIMA).

Difference stationary processes

With a basic understanding of common stochastic process models, we can now discuss the related concept of difference stationary processes and unit roots. This concept relies on the assumption that the stochastic process in question can be written as an autoregressive process of order p, denoted as AR(p):

Equation 8: An autoregressive process of order p, or AR(p).

Where εare usually uncorrelated white-noise processes (for all times t). We can write the same process as:

Equation 9: An AR(p) model written using lag operators.

The part inside the parenthesis on the left is called the characteristic equation of the process. We can consider the roots of this equation:

Equation 10: The characteristic equation of a AR(p) model.

If m=1 is a root of the equation then the stochastic process is said to be a difference stationary process, or integrated. This means that the process can be transformed into a weakly-stationary process by applying a certain type of transformation to it, called differencing.

Figure 7: A time series (left) and a the series after differencing (right)

Difference stationary processes have an order of integration, which is the number of times the differencing operator must be applied to it in order to achieve weak stationarity. A process that has to be differenced r times is said to be integrated of order r, denoted by I(r). This coincides exactly with the multiplicity of the root m=1; meaning, if m=1 is a root of multiplicity r of the characteristic equation, then the process is integrated of order r.

Unit root processes

A common sub-type of difference stationary process are processes integrated of order 1, also called unit root process. The simplest example for such a process is the following autoregressive model:

Unit root processes, and difference stationary processes generally, are interesting because they are non-stationary processes that can be easily transformed into weakly stationary processes. As a result, while the term is not used interchangeably with non-stationarity, the questions regarding them sometimes are.

I thought it worth mentioning here, as sometime tests and procedures to check whether a process has a unit root (a common example is the Dickey-Fuller test) are mistakenly thought of as procedures for testing non-stationarity (as a latter post in this series touches upon). It is thus important to remember that these are distinct notions, and that while every process with a unit root is non-stationary, and so is every processes integrated to an order r>1, the opposite is far from true.

Semi-parametric unit root processes

Another definition of interest is a wider, and less parametric, sub-class of non-stationary processes, which can be referred to as semi-parametric unit root processes. The definition was introduced in [Davidson, 2002], but a concise overview of it can be found [Breitung, 2002].

If you are interested in the concept of stationarity, or have stumbled into the topic while working with time series data, then I hope you have found this post a good introduction to the subject. Some references and useful links are found below.

As I have mentioned, a latter post in this series provides a similar overview of methods of detection of non-stationarity, and another will provide the same for transformation of non-stationarity time series data.

Also, please feel free to get in touch with me with any comments and thoughts on the post or the topic.

References

Academic Literature

Online References

Footnotes

  1. The phrasing here is not strictly accurate, since — as we will soon see — time series cannot be stationary themselves, rather only the processes generating them can. I have used it, however, so as not to assume any knowledge for the opening paragraphs.
  2. The common synonym of weak-sense stationarity as second order stationarity is probably related to (but should not be confused with) the concept of second order stochastic processes, which are defined as stochastic processes that has a finite second moment (i.e. variance).
  3. Note that the opposite is not true. Not every stationary process is composed of IID variables; Stationarity means that the joint distribution of variables doesn’t depend on time, but they may still depend on each other.
  4. This is also a good example for the fact that IID does not imply weak stationarity; since it does imply strong stationarity, however, it has the same necessary and sufficient condition for it to imply strong stationarity: having finite moments.
  5. One minor but interesting notion of stationarity is p-stationary processes.
  6. There are also formal ways to treat times series whose samples are not equally spaced.

--

--

Data Science Consultant. Teacher @ Tel Aviv University's business school. CEO @ Datahack nonprofit. www.shaypalachy.com