Cloud computing is a relatively recent model where scalable and elastic resources are provided as optimized, cost-effective and on-demand utility-like services to customers. As one of the major trends in the IT industry in recent years, cloud computing has gained momentum and started to revolutionise the way enterprises create and deliver IT solutions. Motivated primarily due to cost reduction, these cloud environments are also being used by Information and Communication Technologies (ICT) operating Critical Infrastructures (CI). However, due to the complex nature of underlying infrastructures, these environments are subject to a large number of challenges, including mis-configurations, cyber attacks and malware instances, which manifest themselves as anomalies. These challenges clearly reduce the overall reliability and availability of the cloud, i.e., it is less resilient to challenges. Resilience is intended to be a fundamental property of cloud service provisioning platforms. However, a number of significant challenges in the past demonstrated that cloud environments are not as resilient as one would hope. There is also limited understanding about how to provide resilience in the cloud that can address such challenges. This implies that it is of utmost importance to clearly understand and define what constitutes the correct, normal behaviour so that deviation from it can be detected as anomalies and consequently higher resilience can be achieved. Also, for characterising and identifying challenges, anomaly detection techniques can be used and this is due to the fact that the statistical models embodied in these techniques allow the robust characterisation of normal behaviour, taking into account various monitoring metrics to detect known and unknown patterns. These anomaly detection techniques can also be applied within a resilience framework in order to promptly provide indications and warnings about adverse events or conditions that may occur. However, due to the scale and complexity of cloud, detection based on continuous real time infrastructure monitoring becomes challenging. Because monitoring leads to an overwhelming volume of data, this adversely affects the ability of the underlying detection mechanisms to analyse the data. The increasing volume of metrics, compounded with complexity of infrastructure, may also cause low detection accuracy. In this thesis, a comprehensive evaluation of anomaly detection techniques in cloud infrastructures is presented under typical elastic behaviour. More specifically, an investigation of the impact of live virtual machine migration on state of the art anomaly detection techniques is carried out, by evaluating live migration under various attack types and intensities. An initial comparison concludes that, whilst many detection techniques have been proposed, none of them is suited to work within a cloud operational context. The results suggest that in some configurations anomalies are missed and some configuration anomalies are wrongly classified. Moreover, some of these approaches have been shown to be sensitive to parameters of the datasets such as the level of traffic aggregation, and they suffer from other robustness problems. In general, anomaly detection techniques are founded on specific assumptions about the data, for example the statistical distributions of events. If these assumptions do not hold, an outcome can be high false positive rates. Based on this initial study, the objective of this work is to establish a light-weight real time anomaly detection technique which is more suited to a cloud operational context by keeping low false positive rates without the need for prior knowledge and thus enabling the administrator to respond to threats effectively. Furthermore, a technique is needed which is robust to the properties of cloud infrastructures, such as elasticity and limited knowledge of the services, and such that it can support other resilience supporting mechanisms. From this formulation, a cloud resilience management framework is proposed which incorporates the anomaly detection and other supporting mechanisms that collectively address challenges that manifest themselves as anomalies. The framework is a holistic endto-end framework for resilience that considers both networking and system issues, and spans the various stages of an existing resilience strategy, called (D2R 2+DR). In regards to the operational applicability of detection mechanisms, a novel Anomaly Detection-as-a-Service (ADaaS) architecture has been modelled as the means to implement the detection technique. A series of experiments was conducted to assess the effectiveness of the proposed technique for ADaaS. These aimed to improve the viability of implementing the system in an operational context. Finally, the proposed model is deployed in a European Critical Infrastructure provider’s network running various critical services, and validated the results in real time scenarios with the use of various test cases, and finally demonstrating the advantages of such a model in an operational context. The obtained results show that anomalies are detectable with high accuracy with no prior-knowledge, and it can be concluded that ADaaS is applicable to cloud scenarios for a flexible multi-tenant detection systems, clearly establishing its effectiveness for cloud infrastructure resilience.