What is event-based surveillance?
The goal of event-based surveillance is to detect unusual events that might signal an outbreak. Event-based public health surveillance looks at reports, stories, rumors, and other information about health events that could be a serious risk to public health.
Main categories of event-based surveillance There are basically two main categories of event-based surveillance.
• Such information is unstructured. Information obtained through event-based surveillance can come from sources like reports in the media or rumors on an internet blog.
• In contrast Indicator-based public health surveillance is a more traditional way of reporting diseases to public health officials. Indicator-based surveillance involves reports of specific diseases from health care providers to public health officials. Such information may be described as structured information because the information obtained is standardized.
Examples of event-based health surveillance: WHO’s global surveillance system picks up public health threats 24 hours a day, 365 days a year. Once an event is verified, WHO assesses the level of risk and sounds the alarm. Within 48 hours of an emergency, WHO grades the severity of the event, activates the incident management system and deploys field teams.
GPHIN is an early warning system for potential public health threats worldwide. It is a 24/7 curated situational awareness service from Canada.
Medical Information System is a fully automatic 24/7 public health surveillance system run and maintained by the Joint Research Centre (JRC) of the European Commission.
ProMED is a program of the non-profit International Society for Infectious Diseases.
Design of EventEpi at The Information Centre for International Health Protection. Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. The Information Centre for International Health Protection (Informationsstelle für Internationalen Gesundheitsschutz, INIG) at RKI, performs event-based surveillance to identify events relevant to public health in Germany. Their routine tasks include reading online articles from a defined set of sources, evaluating them for relevance, and then manually filling a spreadsheet with information from the relevant articles. This spreadsheet is called Ereignisdatenbank (IDB).
To support event-based surveillance, but also to gain insights into what makes an article and the event it describes relevant, the authors of “EventEpi–A Natural Language Processing Framework for Event-Based Surveillance” developed a natural-language-processing framework for automated information extraction and relevance scoring.
Their approach consists of two complementary parts: key information extraction and relevance scoring. Both approaches are integrated in a web application called EventEpi. With the exception of the convolutional neural network for which they used Keras, they used the Python package scikit-learn to implement the machine learning algorithms.
The IDB has to be preprocessed before any application of NLP. was not designed to be used with machine learning algorithms. It thus contained some inconsistencies that might not disturb human users but had to be resolved before machine processing. For example a case count could contain numerals as strings instead of numerical digits. Other entries have inconsistent naming schemes. In addition entries in the IDB were written in German but the output of EpiTator has to be in English.
The authors performed named entity recognition in two steps:
- EpiTator, an open-source epidemiological annotation tool, scraped
relevant sources and suggested many different candidates for the
following entities: disease, country, date, and confirmed-case count.
To accomplish the key information extraction, two problems needed to
be solved:
- First, the output of EpiTator needed to be comparable to the entries in the IDB.
- Second and more importantly, the output of EpiTator needed to be filtered. A naive approach to finding the key entity out of all the entities returned by EpiTator is to pick the most frequent one. This approach worked well for detecting the key country and disease, but not for the key date and confirmed-case count. For those, the authors developed a learning-based approach.
- The second part of developing a framework to support EBS was to estimate the relevance of epidemiological articles. The scientists framed the relevance evaluation as a classification problem. They trained a naive Bayes classifier to find the most likely entities in that set. For relevance scoring, the authors defined two classes to which any article might belong:
- The article is relevant if it is in the event-based surveillance database.
- Irrelevant otherwise.
Two sources stood out as being relevant, and easy to scrape:
- World Health Organization Disease Outbreak News (WHO DON)
- ProMED Mail.
The authors compared the performance of different classifiers, using document and word embeddings. State-of-the-art text classifiers tend to use word embeddings for vectorization rather than the tf-idf and bag-of-words approach. Word embeddings are vector representations of words that are learned on large amounts of texts in an unsupervised-manner. Proximity in the word embedding space tends to correspond to semantic similarity. The researchers compared six different classifiers for the relevance scoring task. Two of the tested algorithms stood out:
- The multilayer perceptron performed best overall.
- The support-vector machine, on the other hand, had the highest recall (0.88) which can be of higher interest for epidemiologists.
Finally, the authors integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The same fundamental issues encountered in using machine learning in general apply here as well, in particular bias and explainability.
Tackling individual biases and personal preferences during labeling by experts is essential. It will also be important to show why EventEpi extracted certain information or computed a relevance, for it to be adopted but also critically assessed by epidemiologists for improvement.
At the moment EventEpi only presents results to the user. However it could be expanded to be a general interface to an event database and allow epidemiologists to note which articles were indeed relevant as well as correct key information, an approach called active-learning
The overall framework, can be used in production, promising improvements in event-based surveillance. The source code is publicly available at https://github.com/aauss/EventEpi