How we use Self-Supervised Learning in Personalised Medicine
Capitalising on exceedingly large and real-time physiological data in the absence of sufficient labels
Operating in the realm of physiological and behavioural data has never been more fruitful, nor more challenging. As a consequence of the widespread adoption of wearables, coupled with the ability to decode behaviour information from apps/wearables, users effectively generate data continuously.
While the associated software engineering challenges to store and query this information are largely circumvented by accessible cloud infrastructure APIs, integrating this data into efficient machine learning workflows is non-trivial. Positing models in a Bayesian or RL paradigm allows for online posterior updates to tune models in light of additional information, but this is not applicable to unlocking new unlabeled data sources.
How do we efficiently leverage the acquisitions of novel and large data sources without rebuilding our entire data pipeline?
In the medical machine learning world our data often constitutes:
- Clinical & academic studies: small, highly reliable, explicit datasets that investigate the relationships between physiological, psychological and behavioural phenomena. Often addressing a specific question of interest.
- Realtime user-generated data: large, messy, unlabeled data that is regularly generated during the deployment of products.
Some of our applications are concerned with stress management. The premise and initial machine learning models are built on data from validated clinical studies. These studies allow us to make assumptions about behaviours that both cause and follow periods of high stress.
Consider a product that uses EEG, EKG and behavioural data (extracted from a mobile device) to monitor levels of stress and anxiety. Behavioural data may include things like where somebody is spending their time (work, outdoors, social etc), the use of social media, spending habits etc. The product can nudge behaviour to manage stress, and detect signs of mental health decay.
Clinical studies may be used to build the initial product. After launch, the subsequent data flow may be inconsistent and diverse.
Assume this product requires EEG readings and provides diagnostics and long-term monitoring of stress levels. New users generate large amounts of unlabeled additional EEG data.
Can we leverage the large (unlabeled) EEG datasets to enhance the product offering? To this aid we rely on Self-Supervised Learning (SSL).
Self-Supervised Learning (SSL)
For context, here is an SSL primer, to introduce the broad language, techniques and ideas used in the literature.
The success of human or animal learning in sparse data environments — sometimes able to learn from a single example — is often attributed to common sense. Object permanence and understanding gravity, are two of the most widely cited examples of how humans parse information about the world that prime subsequent learning. Humans, even in infancy, develop an intuition of physics that dictates their understanding of reality.
Statistically, one can consider this common sense as a function that:
- Regularises the parameter space when learning a new task: allowing the individual to approximate generalisable functions despite data sparsity.
- Maps the new task to previously learnt ideas by some distance metrics: capturing the relationship between ideas.
The idea behind self-supervised learning is an attempt to emulate the acquisition of background knowledge (analogous to common sense) to greatly improve the efficiency of learning new tasks.
Most SSL applications entail learning latent, generalisable, structures from large unlabeled data, and then subsequently exploiting this structure to boost performance on downstream tasks.
Text vs Image Data
Recent developments have seen much greater success in natural language processing (NLP) related SSL models than comparable computer vision (CV) models. The current community consensus is that this results from the inability to adequately quantify uncertainty in CV.
Contrastive SSL: uses both positive and negative examples to learn the energy function.
Non-contrastive SSL: uses only positive examples in the data — primarily used to extract predictors/feature engineering.
Now we return to our problem: scaling medical machine learning.
Physiological data regularly constitutes high-frequency signals: EEG, EKG or MEG; or high-resolution images (fMRI). fMRI more readily fits into the existing SSL literature — which is heavily centred around images — whereas neurological and physiological signals require more tailoring.
The SSL pipeline is relatively straightforward:
- Latent Space Mapping: z~ f(x)
Use large (unstructured) datasets, in conjunction with the primary (often labelled) dataset, to learn robust, generalisable, latent feature representations.
2. Downstream Tasks: y~ f(z)
Leverage the transformed space to perform downstream tasks that are applicable to the specific use case.
EEG SSL Applications
Returning to our example product, how exactly do we apply SSL to learn feature invariant EEG transformations?
One promising use of SSL in EEG is to detect and remove artifacts.
EEG datasets generally contain a series of montages, which may be offset to compute relative frequency activity. Nonetheless, a plethora of noisy occurrences distort the observable frequencies. The most common of which include: cardiac, pulse, respiratory, sweat, glossokinetic, eye movement (blink, lateral rectus spikes from lateral eye movement), and muscle and movement artifacts (EEG Intro).
Signal Types: From a signal processing perspective it is useful to categorise artifacts by oscillatory properties:
a) Spikes: Sporadic high-frequency changes in the signal. Often a consequence of movement, muscular impulses, or other irregular activity. Observable by substantial changes in signal variation.
b) Prolonged disturbances: Electrical interferences due to auxiliary cortical regions, interference by machinery, myogenic cardiac signals etc.
Artifact Detection, Smoothing and Removal: We employ a number of artifact removal strategies, the most effective of which include:
- Wavelets transformations, GAMs, Spline.
- Variance models.
- Fast Fourier Transforms.
- Discrete-time warping (DTW).
- Independent Component Analysis (ICA).
- Matrix Profiling (MPs).
- Slow Feature Analysis (SFA).
Applied to EEG-SSL: These strategies — largely unsupervised — can naturally be improved by adding additional data. Thus the latent space mapping can include all available data, and thereafter the feature transformation can be applied to downstream tasks.
Another application of EEG-SSL, Hyperarousals (observable in raw EEG data) provides insight into an individual’s resting state. Large-data based hyperarousal detection naturally lends itself to downstream models.
Particularly relevant in the realm of psychiatry, hyperarousal and spindles present themselves prolonged as elevated electrical signals. This phenomenon reveals prolonged states of agitated mental stimulation (prohibiting sufficient rest, often present in individuals with anxiety and depression).
This is where nuance (or simply large) datasets can become very powerful, the distinction between hyperarousal and artifacts can have substantial overlap.
Again, we employ a series of statistical learning algorithms to flag signals of hyperarousal. Many of which are identical to the aforementioned (though with different permissible ranges); coupled with fitting statistical distributions to the data to quantify significant deviations from the expected cortical behaviour.
EEG-SSL example: We leverage large unlabeled data to fit Gamma distributions to quantify the variability in cortical activity, and then use the associated mappings to predict elevated prolonged anxiety.
Big Data Domain
As datasets scale, many traditional statistical time-series techniques suffer from severe computational limitations. EEG-SSL is scaled by leveraging deep learning, using neural networks as efficient universal function approximators in data-rich environments.
The aforementioned techniques are able to “encode” medical and psychological information by setting theoretically appropriate hyper-parameter configurations. This is mostly lost in the deep learning domain. There is increasing evidence, however, to suggest that the lack of medical and psychological information can be circumvented by the nuance information extracted by the raw data when datasets scale.
The conceptual methodology remains unchanged: we wish to extract meaningful (reduced) representations of the data to provide a suitable transformation to inprove the performance on downstream tasks.
Neural Network-based EEG-SSL Methods
3 broad approaches are generally employed:
- Data Compression
- Signal Segmentation
- Temporal Dependence
1. Data Compression
Recurrent neural nets (usually LSTMs) are regularly coupled with autoencoders to extract meaningful low dimensionality representations of variable-length sequences.
RNNs are used out of necessity, as traditional neural networks required fixed input dimensions. LSTMs (a special case of RNNs) are able to represent both long and short temporal dependencies, capturing the flow of information through time.
The autoencoder is then used to compress the data to its essential latent constituents. An autoencoder is comparable to PCA (learning orthogonal projections) with additional non-linear link functions and increased dimensionality.
Here are some great references deploying these techniques:
- Audio Deep RNNs for feature extraction.
- Adaptive (variable length) Piecewise RNNs Autoencoders.
- Sequence Representations by Autoencoders.
2. Signal Segmentation
Another emergent trend is reliance on generative models. Generative models (which include many autoencoder variants) allow us to capture assumptions about the data generating process.
We regularly assume that signals are generated by a series of basis functions (some global long-term trends; some local sporadic movements and some noise). Signal segmentation techniques aim to decompose signals into their constituents by learning the latent data generating process.
3. Temporal Dependence
Another, more novel, EEG-SSL implementation attempts to uncover temporal dependence in raw EEG data by pseudo-labelling sequences providing a simple distance metric. Banville et el propose an SSL algorithm that randomly samples EEG windows and then binarily labels them by their distance. A (CNN) classifier is then trained to estimate which signals are close to one another.
The idea is, nearby signals possess some similarities. For example, if using EEG to sleep stage, temporal dependence exists in that individuals gradually transition between states (from Wake → N1 → N2 → N3 → REM etc). The SSL can therefore be considered preprocessing, learning a latent representation that measures the ‘closeness’ between signal segments.
These features are used for downstream tasks (in this example, sleep staging), showing promising results.
There’s a lot of information here! Let’s see if we can compress it to its latent constructs ;).
It is our belief, and experience, that combining both purely data-driven data interpolation and medical/psychological informed inference produces the best results.
We wish to use additional unlabeled data to boost the performance of EEG based algorithms.
This is achieved in 3 steps.
1. Given dataset X, and additionl unlabelled data D
2. Learning medically relevant feature transformations z ~ P(X,D)
3. Use the learnt transformation on the original set X to perform downstream tasks y ~ z(X)
This is achieved by learning latent representations on ALL the available data. The additional information: (1) regularises the smaller labelled set, (2) produces more domain-general results, and (3) can be updated in light of new datasets.
Medical or Psychological Domain Expertise
We often use medical and psychological domain knowledge to define the context of the pre-training (SSL) tasks: an advantage seldom seen in the literature as it is less applicable to CV and NPL applications. We take advantage of the medical literature to guide hyper-parameterisation. Hyperarousal detection is one example of this.
These methods allow us to maximise the utility of available data resources. Turning raw physiology into information.