Experiencing SAX: a novel symboli...
Experiencing SAX: a Novel Symbolic Representation of Time Series JESSICA LIN jessica@ise.gmu.edu Information and Software Engineering Department, George Mason University, Fairfax, VA 22030 EAMONN KEOGH eamonn@cs.ucr.edu LI WEI wli@cs.ucr.edu STEFANO LONARDI stelo@cs.ucr.edu Computer Science & Engineering Department, University of California ��� Riverside, Riverside, CA 92521 Abstract. Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. Firstly, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Secondly, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization. Keywords Time Series, Data Mining, Symbolic Representation, Discretize 1. Introduction Many high level representations of time series have been proposed for data mining. Figure 1 illustrates a hierarchy of all the various time series representations in the literature [3, 11, 20, 24, 27, 30, 36, 53, 54, 63]. One representation that the data mining community has not considered in detail is the discretization of the original data into symbolic strings. At first glance this seems a surprising oversight. There is an enormous wealth of existing algorithms and data structures that allow the efficient manipulations of strings. Such algorithms have received decades of attention in the text retrieval community, and more recent attention from the bioinformatics community [5, 19, 25, 52, 57, 60]. Some simple examples of ���tools��� that are not defined for real-valued sequences but are defined for symbolic approaches include hashing, Markov models, suffix trees, decision trees etc. There is, however, a simple explanation for the data mining community���s lack of interest in string manipulation as a supporting technique for mining time series. If the data are transformed into virtually any of the other representations depicted in Figure 1, then it is possible to measure the similarity of two time series in that representation space, such that the distance is guaranteed to lower bound the true distance between the time series in the original space1. This simple fact is at the core of almost all algorithms in time 1 The exceptions are random mappings, which are only guaranteed to be within an epsilon of the true distance with a certain probability, trees, interpolation and natural language.
series data mining and indexing [20]. However, in spite of the fact that there are dozens of techniques for producing different variants of the symbolic representation [3, 15, 27], there is no known method for calculating the distance in the symbolic space, while providing the lower bounding guarantee. In addition to allowing the creation of lower bounding distance measures, there is one other highly desirable property of any time series representation, including a symbolic one. Almost all time series datasets are very high dimensional. This is a challenging fact because all non-trivial data mining and indexing algorithms degrade exponentially with dimensionality. For example, above 16-20 dimensions, index structures degrade to sequential scanning [26]. None of the symbolic representations that we are aware of allow dimensionality reduction [3, 15, 27]. There is some reduction in the storage space required, since fewer bits are required for each value, however the intrinsic dimensionality of the symbolic representation is the same as the original data. There is no doubt that a new symbolic representation that remedies all the problems mentioned above would be highly desirable. More specifically, the symbolic representation should meet the following criteria: space efficiency, time efficiency (fast indexing), and correctness of answer sets (no false dismissals). In this work we formally formulate a novel symbolic representation and show its utility on other time series tasks2. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic representation that lower bound corresponding popular distance measures defined on the original data. As we shall demonstrate, the latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on the classic data mining tasks of clustering [29], classification [24], indexing [2, 20, 30, 63], and anomaly detection [14, 34, 54]. The rest of this paper is organized as follows. Section 2 briefly discusses background material on time series data mining and related work. Section 3 introduces our novel symbolic approach, and discusses its dimensionality reduction, numerosity reduction and lower bounding abilities. Section 4 contains an experimental evaluation of the symbolic approach on a variety of data mining tasks. Impact of the symbolic approach is also discussed. Finally, Section 5 offers some conclusions and suggestions for future work. 2. Background and Related Work Time series data mining has attracted enormous attention in the last decade. The review below is necessarily brief we refer interested readers to [32, 53] for a more in depth review. 2 A preliminary version of this paper appears in [41]. Figure 1: A hierarchy of all the various time series representations in the literature. The leaf nodes refer to the actual representation, and the internal nodes refer to the classification of the approach. The contribution of this paper is to introduce a new representation, the lower bounding symbolic approach Time Series Representations Data Adaptive Non Data Adaptive Spectral Wavelets Piecewise Aggregate Approximation Piecewise Polynomial Symbolic Singular Value Decomposition Random Mappings Piecewise Linear Approximation Adaptive Piecewise Constant Approxima tion Discrete Fourier Transform Discrete Cosine Transform Haar Daubechies dbn n 1 Coiflets Symlets Sorted Coefficients Orthonormal Bi-Orthonormal Interpolation Regression Trees Natural Language Strings Lower Bounding Non- Lower Bounding
2.1 Time Series Data Mining Tasks While making no pretence to be exhaustive, the following list summarizes the areas that have seen the majority of research interest in time series data mining. ��� Indexing: Given a query time series Q, and some similarity/dissimilarity measure D(Q,C), find the most similar time series in database DB [2, 11, 20, 30, 63]. ��� Clustering: Find natural groupings of the time series in database DB under some similarity/dissimilarity measure D(Q,C) [29, 36]. ��� Classification: Given an unlabeled time series Q, assign it to one of two or more predefined classes [24]. ��� Summarization: Given a time series Q containing n datapoints where n is an extremely large number, create a (possibly graphic) approximation of Q which retains its essential features but fits on a single page, computer screen, executive summary etc [43]. ��� Anomaly Detection: Given a time series Q, and some model of ���normal��� behavior, find all sections of Q which contain anomalies or ���surprising/interesting/unexpected/novel��� behavior [14, 34, 54]. Since the datasets encountered by data miners typically don���t fit in main memory, and disk I/O tends to be the bottleneck for any data mining task, a simple generic framework for time series data mining has emerged [20]. The basic approach is outlined in Table 1. Table 1: A generic time series data mining approach 1. Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest. 2. Approximately solve the task at hand in main memory. 3. Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data. It should be clear that the utility of this framework depends heavily on the quality of the approximation created in Step 1. If the approximation is very faithful to the original data, then the solution obtained in main memory is likely to be the same as, or very close to, the solution we would have obtained on the original data. The handful of disk accesses made in Step 3 to confirm or slightly modify the solution will be inconsequential compared to the number of disk accesses required had we worked on the original data. With this in mind, there has been great interest in approximate representations of time series, which we consider below. 2.2 Time Series Representations As with most problems in computer science, the suitable choice of representation greatly affects the ease and efficiency of time series data mining. With this in mind, a great number of time series representations have been introduced, including the Discrete Fourier Transform (DFT) [20], the Discrete Wavelet Transform (DWT) [11], Piecewise Linear, and Piecewise Constant models (PAA) [30], (APCA) [24, 30], and Singular Value Decomposition (SVD) [30]. Figure 2 illustrates the most commonly used representations. Recent work suggests that there is little to choose between the above in terms of indexing power [32], however, the representations have other features that may act as strengths or weaknesses. As a simple example, wavelets have the useful multiresolution property, but are only defined for time series that are an integer power of two in length [11]. One important feature of all the above representations is that they are real valued. This limits the algorithms, data structures and definitions available for them. For example, in anomaly detection we cannot meaningfully define the probability of observing any particular set of wavelet coefficients, since the
probability of observing any real number is zero [38]. Such limitations have lead researchers to consider using a symbolic representation of time series. Figure 2: The most common representations for time series data mining. Each can be visualized as an attempt to approximate the signal with a linear combination of basis functions While there are literally hundreds of papers on discretizing (symbolizing, tokenizing, quantizing) time series [3, 27] (see [15] for an extensive survey), none of the techniques allows a distance measure that lower bounds a distance measure defined on the original time series. For this reason, the generic time series data mining approach illustrated in Table 1 is of little utility, since the approximate solution to problem created in main memory may be arbitrarily dissimilar to the true solution that would have been obtained on the original data. If, however, one had a symbolic approach that allowed lower bounding of the true distance, one could take advantage of the generic time series data mining model, and of a host of other algorithms, definitions and data structures which are only defined for discrete data, including hashing, Markov models, and suffix trees. This is exactly the contribution of this paper. We call our symbolic representation of time series SAX (Symbolic Aggregate approXimation), and define it in the next section. 3. SAX: Our Symbolic Approach SAX allows a time series of arbitrary length n to be reduced to a string of arbitrary length w, (w n, typically w n). The alphabet size is also an arbitrary integer a, where a 2. Table 2 summarizes the major notation used in this and subsequent sections. Table 2: A summarization of the notation used in this paper C A time series C = c1,���,cn C A Piecewise Aggregate Approximation of a time series c1,...,cw C = C �� A symbol representation of a time series c1,...,cw C �� �� �� = w The number of PAA segments representing time series C a Alphabet size (e.g., for the alphabet = {a,b,c}, a = 3) Our discretization procedure is unique in that it uses an intermediate representation between the raw time series and the symbolic strings. We first transform the data into the Piecewise Aggregate Approximation (PAA) representation and then symbolize the PAA representation into a discrete string. There are two important advantages to doing this: ��� Dimensionality Reduction: We can use the well-defined and well-documented dimensionality reduction power of PAA [30, 63], and the reduction is automatically carried over to the symbolic representation. ��� Lower Bounding: Proving that a distance measure between two symbolic strings lower bounds the true distance between the original time series is non-trivial. The key observation that 0 50 100 0 50 100 0 50 100 0 50 100 Discrete Fourier Transform Piecewise Linear Approximation Haar Wavelet Adaptive Piecewise Constant Approximation
allowed us to prove lower bounds is to concentrate on proving that the symbolic distance measure lower bounds the PAA distance measure. Then we can prove the desired result by transitivity by simply pointing to the existing proofs for the PAA representation itself [31, 63]. We will briefly review the PAA technique before considering the symbolic extension. 3.1 Dimensionality Reduction Via PAA A time series C of length n can be represented in a w-dimensional space by a vector c1,K,cw C = . The ith element of C is calculated by the following equation: ���c = = i i���1)+1 j j n w w n w n ci ( (1) Simply stated, to reduce the time series from n dimensions to w dimensions, the data is divided into w equal sized ���frames���. The mean value of the data falling within a frame is calculated and a vector of these values becomes the data-reduced representation. The representation can be visualized as an attempt to approximate the original time series with a linear combination of box basis functions as shown in Figure 3. For simplicity and clarity, we assume that n is divisible by w. We will relax this assumption in Section 3.5. Figure 3: The PAA representation can be visualized as an attempt to model a time series with a linear combination of box basis functions. In this case, a sequence of length 128 is reduced to 8 dimensions The PAA dimensionality reduction is intuitive and simple, yet has been shown to rival more sophisticated dimensionality reduction techniques like Fourier transforms and wavelets [30, 32, 63]. We normalize each time series to have mean of zero and a standard deviation of one before converting it to the PAA representation, since it is well understood that it is meaningless to compare time series with different offsets and amplitudes [32]. 3.2 Discretization Having transformed a time series database into the PAA we can apply a further transformation to obtain a discrete representation. It is desirable to have a discretization technique that will produce symbols with equiprobability [5, 45]. This is easily achieved since normalized time series have a Gaussian distribution [38]. To illustrate this, we extracted subsequences of length 128 from 8 different time series and plotted normal probability plots of the data as shown in Figure 4. A normal probability plot is a graphical technique that shows if the data is approximately normally distributed [1]: an approximate straight line indicates that the data is approximately normally distributed. As the figure shows, the highly linear nature of the plots suggests that the data is approximately normal. For a large family of the time series data in our disposal, we notice that the Gaussian assumption is indeed true. For the small subset of data where the assumption is not 0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 -1 .5 -1 -0 .5 0 0 .5 1 1 .5 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 C C