Skip to content

Refactor introducing base datasets supporting key lookup

Compare
Choose a tag to compare
@C-nit C-nit released this 09 Sep 10:48
· 74 commits to master since this release
1632e15

Many top level datasets in pytoda are "just" torch Datasets supporting __len__ and __getitem__. Typically the dataset itself is (possibly labeling and) pairing keys of samples/entities, where the respective item data comes from specific sources.
So, in their implementations the datasets rely on other datasets specific for an datatype/entity, and rely on getting items via hashable key not integer index.

This PR introduces some base classes in base_dataset.py that provide an interface one can expect from such datasets.

New Base Classes

KeyDataset

Every KeyDataset can implements it's own mechanism to store keys and match them with index/item, but minimally implements get_key and get_index. That's it.
The following methods are available:

    get_key(index: int) -> Hashable
    get_index(key: Hashable) -> int
    get_item_from_key(key: Hashable) -> Any
    keys() -> Iterator
    has_duplicate_keys -> bool
    __add__(other) -> _ConcatenatedDataset

(with default implementations that can be overloaded with more efficient methods)

ConcatKeyDataset

based on torchs ConcatDataset supports concatenation of multiple KeyDatasets. The key lookup through the source datasets was implemented in each top level dataset itself before, now built in and referring to the datasets own implementation of the lookup.
Also featuring methods get_index_pair and get_key_pair to get the dataset index.

DatasetDelegator

Often there are base classes implementing functionality for a datatype, with the setup of the datasource (e.g. eager vs lazy, filetype) left to child classes.
DatasetDelegators with an assigned self.dataset behave as if they were that dataset, delegating all method/attribute calls not implemented to self.dataset. This provides "base" methods saving reimplementation but allows "overloading".

keyed and indexed

Once a dataset is fed to a dataloader that shuffles the data, it's hard to go back and investigate loaded samples without the items index/key.
The keyed and indexed functions called on a dataset will return a shallow copy of the dataset with changed indexing behaviour, also returning the index/key in addition to the item.

While AnnotatedDataset iterates through samples in the annotation file, use of keyed and indexed in contrast would allow to iterate through the samples in the dataset, but still allow looking up labels manually from some source.

Notes

In the case of duplicate keys, the behaviour is implementation specific (i.e. could raise or return first/last).

Many tests were refactored to test lazy/eager backends in the same file w/o code duplication, and added test for base methods where appropriate.

Datasets that filter items for availability from sources usually define masks_df now, that has column wise masks for the original df to allow inspection of missing entities per item.

Refactor

SMILESLanguage and SMILESTokenizer

SMILESLanguage can translate SMILES to token indexes. Transforms of SMILES and transforms of the encoded token indexes are separated and default to identity functions. Definition of the transforms is job of child implementations like SMILESTokenizer or at runtime on instances. There is a named choice of tokenization functions in TOKENIZER_FUNCTIONS, but the function can be passed, too.

The instances can be used to load or build up vocabularies and remember the longest sequence.
The vocabulary can be stored/loaded from/to .json. Additionally an instance can be stored/loaded in a directory of text files (of defined names). This is achieved similar as in the huggingface transformers (added attribution header and licence). A pretrained tokenizer is shipped in pytoda.smiles.metadata.

A new method add_dataset allows building up the vocabulary from an iterable (list, SMILESDataset, ...), that checks for invalid source smiles, applies transform_smiles and passes the result to the tokenizer function to add new tokens.

SMILESDataset and SMILESTokenizerDataset

SMILESDataset is now merely returning SMILES strings as one might expect from the name. This is a breaking change, where users of the old SMILESDataset can use SMILESTokenizerDataset now.
SMILESTokenizerDataset uses SMILESTokenizer as default smiles language to transform items via a SMILESDataset.

SMILESTokenizerDataset can now optionally load a vocab and (not) iterate a dataset.