Simplify exploring preprint publications regarding the coronavirus.
Jonas Dippel, Moritz Pfister, Michael Perk, Yannic Lieder, Eike Niehs
Collabovid provides an easy to use interface to access, sort and classify the huge amount of research articles about the Coronavirus.
Researchers from all over the world are working hard on the investigation of the SARS-CoV-2 virus and the impact of the disease, resulting in many new publications in so-called preprint versions per day, e.g. at medRxiv or bioRxiv. The usual publication process requires a (possibly long) reviewing process, where other experts examine the content in detail before its official publication. However, time is short and thus, a good interface to access, sort and classify the huge amount of preprint papers is needed.
What it does
Several times a day, Collabovid searches for newly published research articles on different well-known preprint servers. Apart from meta information, the content of every article is extracted. Machine learning techniques are used to analyze the publications and make them semantically searchable and comparable.
Our website offers the following features:
- List and access all available preprints regarding SARS-CoV-2 from medRxiv and bioRxiv.
- Sort and filter the preprints by publishing date, author name, title, keywords, and category.
- Show statistics about papers that match a given search query.
- Classify papers into given topics taken from the COVID-19 Open Research Dataset Challenge.
- Select one of the predefined topics and obtain a list of related papers.
- List papers that are related to a user-entered question and rate the resulting publications' relevance.
How we built it
We built our website using a Python backend with Django and a PostgreSQL database. It is deployed via Amazon Web Services on AWS Elastic Beanstalk. We use natural language processing techniques to find correlations between a given query and the content of a paper. For semantic analysis, we use a combination of a pretrained BERT model and a LDA model which is based on the great work from Daniel Wolffram during the COVID-19 Open Research Dataset Challenge (CORD-19).
For all of us, deploying a web application using AWS Elastic Beanstalk was a new experience. Though we had some time-consuming issues when setting it up, we learned a lot and are proud of getting it to work on time.
What we learned
In order to serve relevant papers for a user-entered question and for processing the abstracts of the papers for machine learning purposes, we needed to dive into the topic of Natural Language Processing. Besides, we were able to improve our knowledge of general web development and to gain experience with deploying on AWS.
What's next for COVID-19 Publications
As of now, classifying a paper to topics is done solely based on the paper's abstract. A future version may consider the complete text of the paper and thereby obtain higher accuracy.
Furthermore, we plan to allow verified experts to evaluate and review the papers informally. These reviews may consist of short annotations and a rating on its quality. This could lead to a discussion before a paper is officially peer-reviewed and provide indications for the quality of the articles.
Another plan is to include articles from more preprint servers than only medRxiv and bioRxiv. This will result in a more complete overview of the literature regarding SARS-CoV-2.