-
Notifications
You must be signed in to change notification settings - Fork 17
Resources
Some datasets (i.e., gold-standard corpora) which can be publicly distributed are available in the datatsets
directory of this repository [1].
Alternatively, corpora can be publicly accessed at the following links:
Corpora | Text Genre | Standard | Entities | Publication |
---|---|---|---|---|
AZDC | Scientific Article | Gold | disease | link |
BioInfer | Scientific Article | Gold | genes/proteins | link |
BioSemantics | Patent | Gold | chemicals, disease | link |
CDR | Scientific Article | Gold | chemicals, diseases | link |
CellFinder | Scientific Article | Gold | species, gene/proteins, cells, anatomy | link |
CEMP | Patent | Gold | chemicals | link |
DECA | Scientific Article | Gold | gene/proteins | link |
FSU-PRGE | Scientific Article | Gold | genes/proteins | link |
Linneaus | Scientific Article | Gold | species | link |
IEPA | Scientific Article | Gold | genes/proteins | link |
miRNA | Scientific Article | Gold | diseases, species, genes/proteins | link |
NCBI disease | Scientific Article | Gold | diseases | link |
S800 | Scientific Article | Gold | species | link |
The MLEE corpus [3] was obtained here. We used standoff2conll
to convert it to the IOB format, with the following command:
python2 standoff2conll.py path/to/original_format_corpora/MLEE-1.0.2-rev1/standoff/full -t Cell_proliferation Development Blood_vessel_development Death Breakdown Remodeling Growth Synthesis Gene_expression Transcription Catabolism Phosphorylation Dephosphorylation Localization Binding Regulation Positive_regulation Negative_regulation Planned_process -s IOB > MLEE_IOB.tsv
Word embeddings derived from a combination of PubMed and PMC texts along with a recent English Wikipedia dump (optimal for sequence processing tasks in the biomedical domain) can be obtained here [2].
- Many of these datasets were obtained from https://github.com/cambridgeltl/MTL-Bioinformatics-2016/
- Moen, S. P. F. G. H., & Ananiadou, T. S. S. (2013). Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan (pp. 39-43).
- Sampo Pyysalo, Tomoko Ohta, Makoto Miwa, Han-Cheol Cho, Jun'ichi Tsujii and Sophia Ananiadou. Event extraction across multiple levels of biological organization. Bioinformatics (2012) 28(18):i575-i581.