Accepted author manuscript, 1.05 MB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License
Final published version
Licence: CC BY: Creative Commons Attribution 4.0 International License
Research output: Contribution to Journal/Magazine › Journal article › peer-review
Research output: Contribution to Journal/Magazine › Journal article › peer-review
}
TY - JOUR
T1 - Hybrid Safe Reinforcement Learning
T2 - Tackling Distribution Shift and Outliers with the Student-t’s Process
AU - Hickman, Xavier
AU - Lu, Yang
AU - Prince, Daniel
PY - 2025/6/14
Y1 - 2025/6/14
N2 - Safe reinforcement learning (SRL) aims to optimize control policies that maximise long-term reward, while adhering to safety constraints. SRL has many real-world applications such as, autonomous vehicles, industrial robotics, and healthcare. Recent advances in offline reinforcement learning (RL) - where agents learn policies from static datasets without interacting with the environment - have made it a promising approach to derive safe control policies. However, offline RL faces significant challenges, such as covariate shift and outliers in the data, which can lead to suboptimal policies. Similarly, online SRL, which derives safe policies through real-time environment interaction, struggles with outliers and often relies on unrealistic regularity assumptions, limiting its practicality. This paper addresses these challenges by proposing a hybrid-offline-online approach. First, prior knowledge from offline learning guides online exploration. Then, during online learning, we replace the popular Gaussian Process (GP) with the Student-t's Process (TP) to enhance robustness to covariate shift and outliers.
AB - Safe reinforcement learning (SRL) aims to optimize control policies that maximise long-term reward, while adhering to safety constraints. SRL has many real-world applications such as, autonomous vehicles, industrial robotics, and healthcare. Recent advances in offline reinforcement learning (RL) - where agents learn policies from static datasets without interacting with the environment - have made it a promising approach to derive safe control policies. However, offline RL faces significant challenges, such as covariate shift and outliers in the data, which can lead to suboptimal policies. Similarly, online SRL, which derives safe policies through real-time environment interaction, struggles with outliers and often relies on unrealistic regularity assumptions, limiting its practicality. This paper addresses these challenges by proposing a hybrid-offline-online approach. First, prior knowledge from offline learning guides online exploration. Then, during online learning, we replace the popular Gaussian Process (GP) with the Student-t's Process (TP) to enhance robustness to covariate shift and outliers.
U2 - 10.1016/j.neucom.2025.129912
DO - 10.1016/j.neucom.2025.129912
M3 - Journal article
VL - 634
JO - Neurocomputing
JF - Neurocomputing
SN - 0925-2312
M1 - 129912
ER -