Skip to content

Releases: PaccMann/paccmann_datasets

Query SMILES in PubChem, handle/impute NaN in GeneExpressionDataset

04 Nov 09:32
Compare
Choose a tag to compare

Change to black formatter with configuration files (#93)

conda.yml: refers to requirements.txt (#92)

GeneExpressionDataset: delayed optional imputation of NaN until after statistics collection and transformation (#88)

AugmentByReversing: can now take a probability to perform the reversing (#85)

read_smi: Raise error in when wrong delimiter is used (#85)

Added remove_pubchem_smiles: filtering out PubChem-SMILES (#85)

Refactor introducing base datasets supporting key lookup

09 Sep 10:48
1632e15
Compare
Choose a tag to compare

Many top level datasets in pytoda are "just" torch Datasets supporting __len__ and __getitem__. Typically the dataset itself is (possibly labeling and) pairing keys of samples/entities, where the respective item data comes from specific sources.
So, in their implementations the datasets rely on other datasets specific for an datatype/entity, and rely on getting items via hashable key not integer index.

This PR introduces some base classes in base_dataset.py that provide an interface one can expect from such datasets.

New Base Classes

KeyDataset

Every KeyDataset can implements it's own mechanism to store keys and match them with index/item, but minimally implements get_key and get_index. That's it.
The following methods are available:

    get_key(index: int) -> Hashable
    get_index(key: Hashable) -> int
    get_item_from_key(key: Hashable) -> Any
    keys() -> Iterator
    has_duplicate_keys -> bool
    __add__(other) -> _ConcatenatedDataset

(with default implementations that can be overloaded with more efficient methods)

ConcatKeyDataset

based on torchs ConcatDataset supports concatenation of multiple KeyDatasets. The key lookup through the source datasets was implemented in each top level dataset itself before, now built in and referring to the datasets own implementation of the lookup.
Also featuring methods get_index_pair and get_key_pair to get the dataset index.

DatasetDelegator

Often there are base classes implementing functionality for a datatype, with the setup of the datasource (e.g. eager vs lazy, filetype) left to child classes.
DatasetDelegators with an assigned self.dataset behave as if they were that dataset, delegating all method/attribute calls not implemented to self.dataset. This provides "base" methods saving reimplementation but allows "overloading".

keyed and indexed

Once a dataset is fed to a dataloader that shuffles the data, it's hard to go back and investigate loaded samples without the items index/key.
The keyed and indexed functions called on a dataset will return a shallow copy of the dataset with changed indexing behaviour, also returning the index/key in addition to the item.

While AnnotatedDataset iterates through samples in the annotation file, use of keyed and indexed in contrast would allow to iterate through the samples in the dataset, but still allow looking up labels manually from some source.

Notes

In the case of duplicate keys, the behaviour is implementation specific (i.e. could raise or return first/last).

Many tests were refactored to test lazy/eager backends in the same file w/o code duplication, and added test for base methods where appropriate.

Datasets that filter items for availability from sources usually define masks_df now, that has column wise masks for the original df to allow inspection of missing entities per item.

Refactor

SMILESLanguage and SMILESTokenizer

SMILESLanguage can translate SMILES to token indexes. Transforms of SMILES and transforms of the encoded token indexes are separated and default to identity functions. Definition of the transforms is job of child implementations like SMILESTokenizer or at runtime on instances. There is a named choice of tokenization functions in TOKENIZER_FUNCTIONS, but the function can be passed, too.

The instances can be used to load or build up vocabularies and remember the longest sequence.
The vocabulary can be stored/loaded from/to .json. Additionally an instance can be stored/loaded in a directory of text files (of defined names). This is achieved similar as in the huggingface transformers (added attribution header and licence). A pretrained tokenizer is shipped in pytoda.smiles.metadata.

A new method add_dataset allows building up the vocabulary from an iterable (list, SMILESDataset, ...), that checks for invalid source smiles, applies transform_smiles and passes the result to the tokenizer function to add new tokens.

SMILESDataset and SMILESTokenizerDataset

SMILESDataset is now merely returning SMILES strings as one might expect from the name. This is a breaking change, where users of the old SMILESDataset can use SMILESTokenizerDataset now.
SMILESTokenizerDataset uses SMILESTokenizer as default smiles language to transform items via a SMILESDataset.

SMILESTokenizerDataset can now optionally load a vocab and (not) iterate a dataset.

Protein Language Modelling

10 Mar 11:16
365bdef
Compare
Choose a tag to compare

Added various functionalities for Protein language modelling.
Including a submodule pytoda.proteins with a ProteinLanguage class and 2 new types of datasets, availablle through pytoda.datasets called ProteinSequenceDataset and ProteinProteinInteractionDataset.

Webservice migration

19 Feb 09:57
d25855b
Compare
Choose a tag to compare

Several improvements made, partially in response to the migration of our webservice (https://ibm.biz/paccmann-aas) to a pytorch-deployed model.

Extend SMILES functionalities

26 Nov 10:14
c114f79
Compare
Choose a tag to compare
0.0.2

Major update on SMILES functionality