Heart beat detection and segmentation

This is a quick description of our early detection of heart failure algorithm for feature identification and segmentation.

1) A sound file consists basically of a float array and sampling rate.
2) One normalizes this sound in amplitude (but we can do without) and in sampling rate (2000 times per seconds)
3) Contrary to what is done in Physionet 2016, there is no filtering or elimination of “spikes”.
4) The cardiac rhythm is detected through events that are roughly the S1 events (FindBeats.java). This is not trivial as there are noises, spikes, “plops”, respiration, abnormal heart sounds, human speaking and in one case even dog barks! Basically there are two ways to achieve the beat detection, one is to find the frequency of the envelope of the signal, the other is to make a FFT. But both approaches are subjectives as they imply to decide beforehand what is an acceptable heart rate. What does not help is that heart rate can roughly go from 30Hz to 250Hz, so as we can detect three or four main frequencies between 30Hz to 250Hz in the FFT of the sound file, which one is the correct one? One cannot decide in advance. What we do is having a two steps and indirect approach that fails quickly if we make a wrong guess. So it could converge quickly toward a result:
* In the first step we try to estimate the heart beat duration independently of the heart rate (because there are noise sources that make the heart rate counting unreliable).
* We use this first estimation to make a better guess of the heart rate in the second step. If the guess gives obviously wrong results we change the threshold for detecting heart rate.

4-1) One seeks to find the optimal detection level (threshold) enabling to detect heart beat rate in this sound file.
4-1-1) We create a window on the sound file, that is very low-pass filtered (currently 200Hz)
4-1-2) We register times of downward passage through the current treshold level
4-1-3) If a convincing heart rate (not necessarily exact) is obtained, the level of detection is kept, otherwise it is lowered and the procedure is repeated.
4-2) With this threshold, we will try to obtain a cardiac rhythm with roughly the same procedure as above but with a window filtered at 1000Hz and in a wide interval around the heart rate obtained in the first step.
4-2-1) We create a sound window which is low-pass filtered (currently 1000Hz)
4-2-2) A downward passage through the threshold is detected, ad hoc margins are imposed between the start and the end of the event.
4-2-3) If a convincing cardiac rhythm is obtained, the heart rate is kept, otherwise the level of detection is lowered and we start again

So we have here an approach which is very progressive, yet delivers results in a short time. It is also quite insensitive to sound events that could derail the heart beat counting, as the first step provide some good indications of the area where is the real heart beat, one spike makes the closer S1 perhaps not detected, but we gain a good idea of the heart duration.
5) In addition to S1, many other events are detected. A priori we assume that these are the other events S2, S3, S4. Even numbers going higher than four, it is useful for unusual heart sound classification.
6) These two Sx events detections are brought closer together
6-1) The list of S1 events is made more reliable, which makes it possible to deduce the events S2, S3, S4 (Segmentation.java).
6-2) A signature of the heart beat is computed to acknowledging there are more in one heart beat than its time of arrival and duration. We tested several schemes, and decided to use a Huffman compression of the heart beat. We had also the idea to use this to yet another kind of feature detection without training but it is not implemented at the moment because of lack of resources.
7) From there one can either train a HMM, or classify. We go out of cardiac specific, it’s just an HMM
8) One interprets the classification made by the HMM, with a score of similarity and comments on the events Sx

Another difference with Physionet 2016 is that they add a second approach, which is the variability of cardiac rhythm, they calculate a lot of indicators, but the HMM does it more accurately, with a probability of Transition from internal to internal state that is explained, instead of being a scalar computed over the whole file.

In a future version, I would like to work on the frequency peaks that can be identified with an FFT.
The general idea would be to look for what would be “apparently” harmonics but which in fact would indicate echoes on different paths.

Heart sound files analysis

Not much to show, but some news:
Sounds files have problems that I did not anticipated. What I was expecting from the analysis of Physionet 2016 submissions was noise, spikes, weird amplitude and similar distortions of the signal.
What I found was different, there is little noise while you filter it a bit, there are few spikes.
However sometimes the signal is biased (more negative values than positive), the signal also appears to have little in common with textbooks, I can easily detect S1 and S2 events, but it is difficult to find S3 and S4.

When you hear the sounds, half of them looks weird, I am not a cardiologist, but I find it difficult to find what I could hear in a “textbook” heart sound.
This makes me think again about the Physionet 2016, successful submissions where mainly about heavily filtering, dealing with spikes with sophisticated algorithms and finding characteristics (features in ML slang) that encompass the whole file such as RR variability as in:

https://en.wikipedia.org/wiki/Heart_rate_variability​

Clearly my approach is different, I focus on what identify a heart beat, which is entirely new. But I still plan to implement the RR variability analysis and tied it to my HMM classifier which will become quite hybrid in the process.

Early and low cost detection of Heart Failure

Six months ago we registered a new project on Hackerday and some other places.
https://hackaday.io/project/19685-early-and-low-cost-detection-of-heart-failure

The idea was to detect early the heart failure condition, because it is a condition that affects most of us as we age, and there was a lot of material online thanks to various challenges about this subject, like this one:
https://physionet.org/challenge/2016/

To create a proof of concept we used a low cost fetal doppler ($50) and a Linux box and were able to record heart sounds without using gel on an adult. So one of the requirement for medical devices was filled: To be ready to be used in seconds.

In most medical devices, there is an implicit requirement: To make the output understandable, it must offer an explanation of the medical statement. So using a black box ML à la Kaggle, is out of question.

In heart sound competition like Physionet 2016, they train HMM in order to create a statistical model of the heart sounds of some condition, HMM can “explain” their internal model by showing the probability a appearance of each state, for example the probability of arrival of a S2 sound at some time after a S1 sound in a particular sequence of heart sounds.
So a HMM model can be used to classify a new sequence of heart sounds either as quite similar to the trained model or not.

One might ask why not using deep learning as it seems to have made wide steps forward recently and as very nice software are available like TensorFlow.
There is a big internal difference between ML using CNNs à la TensorFlow and ML using HMMs, in “an ideal world” a CNN finds its feature without human intervention where a HMM needs that each observation is “tagged” with some human knowledge with a Viterbi or similar function. The tagging is part of what makes the resulting model understandable, however automatic tagging (as in unsupervised learning) is indeed hard.

In truth there is similarity between the design of successful CNNs and HMMs, they have a cost function, however CNNs cost functions do not create meaning.
Designing the cost function of a CNN or the Viterbi function of a HMM is the most important part of any ML setup. All claim that we heard about effectiveness of ML are due to the design of those functions, not to some fancy ML algorithm.
It is a very hard job, far above the state of art.

In order to circumvent this problem, most ML proposals use another ML setup to create the cost function as in most Physionet 2016 challenges or in a recent article that is highly considered in the domain of skin cancer detection: http://www.nature.com/nature/journal/v542/n7639/abs/nature21056.html.

Indeed if one uses ML to create the cost function, the resulting model becomes highly opaque, and medical policy maker, scientists, or specialists will find it useless or even dangerous.

On the long term this practice of using a ML derived cost function will be discouraged, but I suppose this is part of the current hype curve about “deep learning”. It is worse when using small signals ML like in Deep Forest algorithm, where it becomes impossible (today) to reverse engineer the ML model by disturbing it. In addition deep learning cannot be done with a $50 device, it mandates huge computing facilities.

So we created our own Viterbi function for our HMM, and it is quite efficient while being quite simple. The next steps are to improve it, make it more informative and able to move from the Linux box to a microcontroller. Stay tuned.

Aida as a tool to assist and audit the science process

DARPA’s AIDA
DARPA’s new AIDA program may (also) help to provide better understanding of science publications and results, by helping separating out interesting from irrelevant data.
yyy.jpg

Information complexity has exceeded the capacity of scientists to glean meaningful or actionable insights and doing science is increasingly more difficult as time passes. World class experts in one field will not understand statements made by scientists, even if only slightly outside their field.
The situation is even worse for other science stakeholders, such as scientific managers and policy makers who have a long interest in developing and maintaining a strategic understanding and evaluation of the scientific activity, field landscape, and trends. Information obtained from scientific publishing are often analyzed without their contexts. Often because of the complexity and superabundance of information, independent analysis results in interpretation which may be inaccurate.
It would be interesting to overcome the noisy, and often conflicting assertions made in today’s scientific publishing environment through a common tooling. Some efforts have already been done, for example the excellent Galaxy tool in Biology and to a less extend Notebook interfaces like Jupyter in coding. Another interesting trend is the pre-print activity which helps share information unsuitable for publishing with other scientists.
DARPA’s AIDA program aims to create technology capable of aggregating and mapping pieces of information. AIDA may provide a multi-hypothesis “semantic engine” that would automatically mine multiple publishing source and extract their common foreground assertions and background knowledge, and then it will generate and explore multiple hypotheses that will interrogate their true nature and implications.
The AIDA program hopes to determine a confidence level for each piece of information, as well as for each hypothesis generated by the semantic engine. The program will also endeavor to digest and make sense of information or data in its original form and then generate alternate contexts by adjusting or shifting variables and probabilities in order to enhance accuracy and resolve ambiguities in line with real-science expectations.
Even structured data can vary in the expressiveness, semantics, and specificity of their representations. AIDA has the potential to help scientists and science decision makers refine their analyses so that they are more in line with the larger and more complete overall context, and in doing so achieve a more thorough understanding of the elements and forces shaping science.

Low cost, non invasive, Continuous Glucose Monitoring utilizing Raman spectroscopy

A high quality, low cost, non invasive Continuous Glucose Monitoring (CGM) based mainly on Raman spectroscopy, is presented. In addition a number of sensors provide information about patient’s context. The CGM re-calibrates itself automatically.

Designing non-invasive continuous glucose monitoring (CGM) is an incredibly complex problem that presents a number of challenging medical and technological hurdles. It is told that around 70 companies tried to bring non-invasive glucose monitoring devices to the market without any success.

Quality in our CGM proposal comes from the number of technologies used to increase measurement precision. The understanding of the biological operating context enables to accurately predict glucose values.

More information here: glucose_monitor

Analysing eyes’ biomarkers at home with passive infrared radiation.

Currently there is no portable device that can check diseases of the aging eye such as: Glaucoma, age-related macular degeneration, Diabetic retinopathy, Alzheimer’s Disease, Cataract, clinically significant macular edema, keratoconjunctivitis sicca (dry eye disorder), Sjogren’s syndrome, retinal hard exudates, ocular hypertension, uveitis.

We propose a portable device which when placed before one eye but without any physical contact, analyzes its natural infrared spectrum in order to detect molecules that reveal a potential medical condition. If a biomarker is detected, the device asks to the user to consult a medical doctor, with an indication about urgency but without disclosing any medical information. On contrary the doctor can securely access a wealth of information without needing a dedicated device.

The medical doctor proposes this tool to the patient, and is constantly in control of the device and the relation she has with her patient.

More information here: passive_eye_care

Mesosphere light scattering as Cell tower substitute.

Modern wireless technology can’t transmit energy and information with a good enough SNR, over 80km and over earth curve, in portable low cost devices with current regulations.

We propose a very different approach based on astronomy technology, where a laser emits light vertically, generates a luminous dot at high altitude (similar to astronomy’s guidestar) and this light is detected at very long distance. By modulating the luminosity of this guidestar, it is possible to transmit information. This technology works even if the sky is cloudy and in daylight.

There is no need to build any infrastructure network. Each cell in a field can access the base station even at 80km. The cost per field station is less than $9,000. Field stations can be moved at will.

More information here: base_station_for_deserts

What become old biology tools?

One of my colleagues remarked today that a lot of old biology software tools and libraries are designed by academics and abandoned as soon as their interest (and funding) switches to something else.
So it is very dangerous for a community or a business to rely on this kind of tools, as when something goes wrong there is no expert in sight ready to offer an helping hand.
I wondered if something could be done to improve this situation. At least a list of such abandoned biology software could be maintained.

Ontologies are not magic wands

Some 15 years ago, ontologies were the big thing. Financing an EU project was easy if ontologies and semantics were mentioned as primary goals.
Now this time is gone, except in biology where ontologies are still used, often in a very different way from what they were originally intended to do in the “Semantic Web” good old time.

More specifically a common biology research activity is to measure the expression of proteins in  two situations, for example in healthy people and in patients. Then the difference between the two sets of measurements is asserted, and the proteins and their genes that are activated in the illness situation are suspected to be possible targets for any new drug.

Gene differential expression is the biological counterpart of machine learning in CS, it is a one size fits all solving methodology.

Indeed those deferentially expressed genes are rarely possible targets for any new drug , as each protein and gene is implicated in so many pathways. So instead of refining the experimentation, to find genes that are implicated in a fewer pathways, a gene “enrichment” step is launched. “Enrichment” involves querying an ontology database, to obtain a list of genes/proteins that are related to the deferentially expressed genes, and that are hopefully easier target for putative drugs.

Here there are two problems.
* The first is the choice of the ontology, for example there is an excellent one which is named Uniprot. But there are some awful but often preferred choices, like Gene Ontology which gives dozens results when Uniprot gives one. Indeed if you have only one result after “enrichment” and if you are in a hurry, you are not happy, so the incentive to switch to Gene Ontology is strong.
* The second problem arises when the result set comprises several hundred genes/proteins. Obviously this is useless, but instead of trying to define a better experimentation, people thought that some statistical criteria would sort the result set and find a credible list of genes. This lead to the design of parameter free tools such as GSEA. Very roughly these tools compare the statistical properties of the result set with those of a random distribution of genes, if they are very different, then the conclusion is that they are not at random, which does not tell much more than that. This is similar and related to the criticism of fisher test, p-value and null hypothesis. This is a complicated domain of knowledge.

These tools are very smart, but the best tool cannot provide meaningful answers from garbage, so disputes soon arisen about the choice of the free parameter methodology, instead of questioning the  dubious practices that made them needed in the first place.

 

When PBPK simulators do not reflect modern physiology

PBPK simulators use a compartmental approach, where fluids are transferred between compartments and transformed inside them.

It is a very mechanistic approach, a successful one, but it ignores many important aspects of Mammalian biology such as the influence of the genome on health or the signaling between cells or throughout the organism, for example with the immune system.

Even the illness or simply the unhealthy human, is not implemented in models, rather they are “cases” that are hard-wired in the software.

It is well known there is a need to separate the model from the simulator, in order to make it possible to change some parameters or even the whole model at will. Every CellML or SBML simulator offers that kind of functionality.

It goes the same way for genetic information, not only it should be taken in account, but it should be separated and accessible in its own set of portable data. I do not know how SBML format would make it possible.

Cell or organism signaling should also be assigned to a distinct set of portable data. We have already something similar for fluids in our current simulator’s PoC, it is separated in a distinct XML file, something unfortunately not standardized.

Therefore we have to think how fluids, genetic information (and variants) as well as signaling or health will be taken in account in future versions of the PoC of our simulator.

In addition we have to offer a multi-faceted GUI, for example a human diabetic model and a dysfunction of insulin production are nearly about the same thing, but they are different ways to discuss about it and they are not the exactly the same topic.