Releases: PaccMann/paccmann_datasets
Query SMILES in PubChem, handle/impute NaN in GeneExpressionDataset
Change to black formatter with configuration files (#93)
conda.yml
: refers to requirements.txt
(#92)
GeneExpressionDataset
: delayed optional imputation of NaN until after statistics collection and transformation (#88)
AugmentByReversing
: can now take a probability to perform the reversing (#85)
read_smi
: Raise error in when wrong delimiter is used (#85)
Added remove_pubchem_smiles
: filtering out PubChem-SMILES (#85)
Refactor introducing base datasets supporting key lookup
Many top level datasets in pytoda are "just" torch Datasets supporting __len__
and __getitem__
. Typically the dataset itself is (possibly labeling and) pairing keys of samples/entities, where the respective item data comes from specific sources.
So, in their implementations the datasets rely on other datasets specific for an datatype/entity, and rely on getting items via hashable key not integer index.
This PR introduces some base classes in base_dataset.py
that provide an interface one can expect from such datasets.
New Base Classes
KeyDataset
Every KeyDataset can implements it's own mechanism to store keys and match them with index/item, but minimally implements get_key
and get_index
. That's it.
The following methods are available:
get_key(index: int) -> Hashable
get_index(key: Hashable) -> int
get_item_from_key(key: Hashable) -> Any
keys() -> Iterator
has_duplicate_keys -> bool
__add__(other) -> _ConcatenatedDataset
(with default implementations that can be overloaded with more efficient methods)
ConcatKeyDataset
based on torchs ConcatDataset
supports concatenation of multiple KeyDataset
s. The key lookup through the source datasets was implemented in each top level dataset itself before, now built in and referring to the datasets own implementation of the lookup.
Also featuring methods get_index_pair
and get_key_pair
to get the dataset index.
DatasetDelegator
Often there are base classes implementing functionality for a datatype, with the setup of the datasource (e.g. eager vs lazy, filetype) left to child classes.
DatasetDelegator
s with an assigned self.dataset behave as if they were that dataset, delegating all method/attribute calls not implemented to self.dataset. This provides "base" methods saving reimplementation but allows "overloading".
keyed
and indexed
Once a dataset is fed to a dataloader that shuffles the data, it's hard to go back and investigate loaded samples without the items index/key.
The keyed
and indexed
functions called on a dataset will return a shallow copy of the dataset with changed indexing behaviour, also returning the index/key in addition to the item.
While AnnotatedDataset
iterates through samples in the annotation file, use of keyed
and indexed
in contrast would allow to iterate through the samples in the dataset, but still allow looking up labels manually from some source.
Notes
In the case of duplicate keys, the behaviour is implementation specific (i.e. could raise or return first/last).
Many tests were refactored to test lazy/eager backends in the same file w/o code duplication, and added test for base methods where appropriate.
Datasets that filter items for availability from sources usually define masks_df
now, that has column wise masks for the original df to allow inspection of missing entities per item.
Refactor
SMILESLanguage
and SMILESTokenizer
SMILESLanguage
can translate SMILES to token indexes. Transforms of SMILES and transforms of the encoded token indexes are separated and default to identity functions. Definition of the transforms is job of child implementations like SMILESTokenizer
or at runtime on instances. There is a named choice of tokenization functions in TOKENIZER_FUNCTIONS
, but the function can be passed, too.
The instances can be used to load or build up vocabularies and remember the longest sequence.
The vocabulary can be stored/loaded from/to .json. Additionally an instance can be stored/loaded in a directory of text files (of defined names). This is achieved similar as in the huggingface transformers (added attribution header and licence). A pretrained tokenizer is shipped in pytoda.smiles.metadata
.
A new method add_dataset
allows building up the vocabulary from an iterable (list
, SMILESDataset
, ...), that checks for invalid source smiles, applies transform_smiles
and passes the result to the tokenizer function to add new tokens.
SMILESDataset
and SMILESTokenizerDataset
SMILESDataset
is now merely returning SMILES strings as one might expect from the name. This is a breaking change, where users of the old SMILESDataset
can use SMILESTokenizerDataset
now.
SMILESTokenizerDataset
uses SMILESTokenizer
as default smiles language to transform items via a SMILESDataset
.
SMILESTokenizerDataset
can now optionally load a vocab and (not) iterate a dataset.
Protein Language Modelling
Added various functionalities for Protein language modelling.
Including a submodule pytoda.proteins
with a ProteinLanguage
class and 2 new types of datasets, availablle through pytoda.datasets
called ProteinSequenceDataset
and ProteinProteinInteractionDataset
.
Webservice migration
Several improvements made, partially in response to the migration of our webservice (https://ibm.biz/paccmann-aas) to a pytorch-deployed model.
Extend SMILES functionalities
0.0.2 Major update on SMILES functionality