Transformer-based models, highly successful in natural language processing and computer vision, are increasingly being applied to time series analysis tasks such as forecasting and anomaly detection [1]. However, the unique structure of temporal data (cycles, seasonality) differs significantly from that of language, raising questions about whether standard positional encoding mechanisms are truly suitable for this domain.
Previous research already suggests that their effectiveness in time series applications may be limited [2].
This project aims to analyze published experiments to validate (or refute) these findings across different application domains — such as finance, energy, and industrial monitoring — using multiple accuracy and anomaly detection metrics.
We also explore an alternative form of positional encoding specifically designed for time series applications. Various approaches will be studied, including conventional positional encodings and specialized architectures that integrate these principles. Their impact on prediction quality and early anomaly detection will be assessed, comparing them against standard methods while analyzing advantages and limitations in terms of interpretability, generalization capacity, and computational cost.
- Conduct an exhaustive review of the state of the art on positional encoding and its application in Transformer models for time series.
- Systematically evaluate the effectiveness of standard positional encodings in forecasting and anomaly detection tasks using benchmark time series datasets.
- Study novel proposals for positional encodings or complete architectures adapted to time series, exploring alternative approaches beyond the current state of the art.
To compare different methods, a .py file has been created that accepts multiple input parameters to configure the model as desired, allowing specification of the type of positional encoding (PE) and its associated hyperparameters.
This is the run_exp.py file, which modifies the behavior of a base Informer model [5].
--model Type of model to use (informer)
--ex_name Experiment name
--folder Directory where the model (e.g., InformerVanilla or InformerRope) is located
--data Dataset name
--root_path Root path where the dataset is located
--data_path Data file name
--features Type of prediction: M (multi→multi), S (uni→uni), MS (multi→uni)
--target Target variable for S or MS tasks
--freq Temporal frequency for encoding (hours: h; minutes: t; seconds: s)
--checkpoints Path to save model checkpoints
--seq_len Input sequence length for the encoder
--label_len Length of the decoder’s start token
--pred_len Length of the sequence to predict
--enc_in Number of input variables to the encoder
--dec_in Number of input variables to the decoder
--c_out Number of model outputs
--d_model Model dimension
--n_heads Number of attention heads
--e_layers Number of encoder layers
--d_layers Number of decoder layers
--s_layers Stacked encoder layers (stack mode only)
--d_ff Inner feed-forward dimension
--factor Reduction factor for probabilistic attention
--padding Padding type (0: none, 1: same)
--distil Disable distilling if included
--dropout Dropout rate
--attn Type of attention in encoder (use 'full' to avoid information loss)
--time_encoding Type of positional/temporal encoding (see below)
--embed Type of temporal embedding (timeF, fixed, learned)
--activation Activation function (e.g., gelu, relu)
--window Window size for statistics
--output_attention Displays encoder-generated attention
--cols Specific dataset columns to use
--num_workers Number of DataLoader workers
--itr Number of experiment repetitions
--train_epochs Number of training epochs
--batch_size Batch size for training (default: 32)
--patience Patience for early stopping (default: 3)
--learning_rate Learning rate
--des Experiment description
--loss Loss function (mse, mae, etc.)
--lradj Learning rate adjustment strategy
--use_amp Use mixed-precision training (AMP)
--inverse Invert output transformation
--shuffle_decoder_input Shuffle decoder inputs during testing
--use_gpu Use GPU if available
--gpu GPU index to use
--use_multi_gpu Enable multi-GPU support
--devices IDs of GPUs to use
| Value | Description |
|---|---|
no_pe |
No positional encoding; only raw input data are used. |
informer |
Original temporal encoding from Informer. |
stats |
WinStat base: Encoding based on sliding-window statistics, computing mean, std, and extrema. |
stats_lags |
WinStatLag: Same as stats, but includes lag features as local context. |
all_pe_weighted |
WinStatFlex: Weighted combination of the above, plus fixed and learnable PEs (LPE), normalized via Softmax. |
tpe |
WinStatTPE: Temporal Positional Encoding (t-PE), integrating lag, window, and fixed PE information with learned, Softmax-normalized weights. |
tupe |
Transformer with Untied Positional Encoding (TUPE), a method that decouples word and positional correlations in the self-attention module, enhancing model expressiveness. |
rope |
Rotary Positional Encoding (RoPE), a method that encodes positional information by applying rotation matrices to the query and key vectors in self-attention, preserving relative distances and improving sequence modeling. Folder parameter must be set to 'InformerRope' |
Some PE variants with performance below the baseline (no_pe) were omitted, as well as those that disrupted the attention mechanism due to poor dataset performance (SPE) [4].
An example execution can be found in the
slurm_task.shfile.
To run this project, an updated environment with PyTorch (Python 3.12) is required.
You can create it using the requirements.txt file:
conda create --name --file requirements.txt
[1] Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., & Sun, L. (2022). Transformers in time series: A survey. arXiv preprint arXiv:2202.07125.
[2] Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2023, June). Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 9, pp. 11121–11128).
[3] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 12, pp. 11106–11115).
[4] Irani, H., & Metsis, V. (2025). Positional Encoding in Transformer-Based Time Series Models: A Survey. arXiv preprint arXiv:2502.12370. https://arxiv.org/abs/2502.12370
[5] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021, May). Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 12, pp. 11106–11115).