publications
2025
- PreprintStaying on the Manifold: Geometry-Aware Noise InjectionAlbert Kjøller Jacobsen, Johanna Marie Gegenfurtner, and Georgios ArvanitidisarXiv preprint arXiv:2509.20201, 2025
It has been shown that perturbing the input during training implicitly regularises the gradient of the learnt function, leading to smoother models and enhancing generalisation. However, previous research mostly considered the addition of ambient noise in the input space, without considering the underlying structure of the data. In this work, we propose several methods of adding geometry-aware input noise that accounts for the lower dimensional manifold the input space inhabits. We start by projecting ambient Gaussian noise onto the tangent space of the manifold. In a second step, the noise sample is mapped on the manifold via the associated geodesic curve. We also consider Brownian motion noise, which moves in random steps along the manifold. We show that geometry-aware noise leads to improved generalization and robustness to hyperparameter selection on highly curved manifolds, while performing at least as well as training without noise on simpler manifolds. Our proposed framework extends to learned data manifolds.
@article{jacobsen2025stayingmanifoldgeometryawarenoise, title = {Staying on the Manifold: Geometry-Aware Noise Injection}, author = {Jacobsen, Albert Kj{\o}ller and Gegenfurtner, Johanna Marie and Arvanitidis, Georgios}, journal = {arXiv preprint arXiv:2509.20201}, year = {2025}, venue = {Preprint}, }
- PreprintMonge SAM: Robust Reparameterization-Invariant Sharpness-Aware Minimization Based on Loss GeometryAlbert Kjøller Jacobsen and Georgios ArvanitidisarXiv preprint arXiv:2502.08448, 2025
Recent studies on deep neural networks show that flat minima of the loss landscape correlate with improved generalization. Sharpness-aware minimization (SAM) efficiently finds flat regions by updating the parameters according to the gradient at an adversarial perturbation. The perturbation depends on the Euclidean metric, making SAM non-invariant under reparametrizations, which blurs sharpness and generalization. We propose Monge SAM (M-SAM), a reparametrization invariant version of SAM by considering a Riemannian metric in the parameter space induced naturally by the loss surface. Compared to previous approaches, M-SAM works under any modeling choice, relies only on mild assumptions while being as computationally efficient as SAM. We theoretically argue that M-SAM varies between SAM and gradient descent (GD), which increases robustness to hyperparameter selection and reduces attraction to suboptimal equilibria like saddle points. We demonstrate this behavior both theoretically and empirically on a multi-modal representation alignment task.
@article{jacobsen2025monge, title = {Monge SAM: Robust Reparameterization-Invariant Sharpness-Aware Minimization Based on Loss Geometry}, author = {Jacobsen, Albert Kj{\o}ller and Arvanitidis, Georgios}, journal = {arXiv preprint arXiv:2502.08448}, year = {2025}, venue = {Preprint}, }
- ICASSPHow Redundant Is the Transformer Stack in Speech Representation Models?In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025
Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model’s predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size by 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.
@inproceedings{dorszewski2025redundant, title = {How Redundant Is the Transformer Stack in Speech Representation Models?}, author = {Dorszewski, Teresa and Jacobsen, Albert Kj{\o}ller and T{\v{e}}tkov{\'a}, Lenka and Hansen, Lars Kai}, booktitle = {ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages = {1--5}, year = {2025}, organization = {IEEE}, }