Refactoring and randomness test

The first usable versions of our features detection code (findbeats.java) were full of hardwired constants and heuristics.
The code has now been modularized, it spreads several methods with clean condition of method exit.

We were proud that our design was able to look at each beat and heart sound, which is a far greater achievement than what ML code does usually. Something really interesting was how we used compression to detect heart sounds features automatically in each beat.
Now we introduce something similar in spirit: Until now sometimes our code was unable to find the correct heart rate if the sound file was heavily polluted by noise. Now we use a simple statistical test akin to standard deviation, to test the randomness of beats distribution. If it is distributed at random, then it means our threshold is too low: We detect noise in addition to the signal.
This helped us to improve the guessing of the heart rate.

In an unrelated area, we also started to work on multi-HMM, which means detecting several, concurrent features. An idea that we toy with, would be to use our compression trick, at beat level, whereas now it is used at heart sound level. This is tricky and interesting in the context of a multi-HMM. Indeed it makes multi-HMM​ more similar to unsupervised ML algorithms.

Multi HMM for heart sound observations

Up to now the feature detection has used something that I find funny, but it works really well. As we use Hidden Markov Models, we must create a list of “observations” for which the HMM infer a model (the hidden states). So creating trustable observations is really important, it is a design decision that those observations would be the “heart sounds” that cardiologists name S1, S1, etc..

In order to detect those events, we first have to find the heart beats, then find sonic events in each of them. In CINC/Physionet 2016 they use a FFT to find the the basic heart rate, and because a FFT cannot inform on heart rate variability, they compute various statistical indicators linked to heart rate variability.
And its not a very good approach as the main frequency of a FFT is not always the heart beat rate.
Furthermore this approach is useless at the heart beat level and indeed at heart sound level. So what we did, was to detect heart beats (which is harder that one could think) and from that point, we can detect heart sounds.

Having a series of observations that would consist only of four heart sounds, would not be useful at all. After all a Sn+1 heart sound, is simply the heart sound that comes after the Sn heart sound. We needed more information to capture and somehow pre-classify the heart sounds.

It was done (after much efforts) by computing a signature based somehow on a compressed heart sound. Compression is a much more funny thing that it might seem. To compress one has to reduce the redundant information as much as possible, which means that a perfectly compressed signal could be used as a token about this signal, and logical operations could be done with it.

Sometimes people in AI research fantasize that compression is the Graal of machine learning by making feature detection automatic. We are far from thinking that, as that in order to compress one has to understand how the information is structured, and automatic feature detection implies that we do not know its structure.

It is the same catch-22 problem that the Semantic Web met 10 years ago, it can reason on structured data but not on unstructured data, and the only thing that would have been a real breakthrough was reasonning on unstructured data. That is why now we have unsupervised Machine Learning with algorithms like Deep Forest. While Cinc 2016 submissions used heavily unsupervised ML, we used compression (Run Limited Length) to obtain a “signature” of each heart sound, and it works surprisingly well with our HMM.

The next step is to implement a Multi-HMM Approach​, because there are other possibilities to pre-categorize our heart sounds than its RLL signature, for example the heart sound might be early or late and that characteristic could be used to label it.