Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Random seed is useless in get_molnet_dataset function #418

Open
Minys233 opened this issue Mar 14, 2020 · 1 comment
Open

Random seed is useless in get_molnet_dataset function #418

Minys233 opened this issue Mar 14, 2020 · 1 comment

Comments

@Minys233
Copy link

In this function, although there is a seed=777 argument in the signature like this.

def get_molnet_dataset(dataset_name, preprocessor=None, labels=None,
split=None, frac_train=.8, frac_valid=.1,
frac_test=.1, seed=777, return_smiles=False,
return_pdb_id=False, target_index=None, task_index=0,
**kwargs):

But it's never passed to any splitter, in the same function, the splitter is called here without seed argument:

if dataset_config['dataset_type'] == 'one_file_csv':
split = dataset_config['split'] if split is None else split
if isinstance(split, str):
splitter = split_method_dict[split]()
elif isinstance(split, BaseSplitter):
splitter = split
else:
raise TypeError("split must be None, str or instance of"
" BaseSplitter, but got {}".format(type(split)))
if isinstance(splitter, ScaffoldSplitter):
get_smiles = True
else:
get_smiles = return_smiles
result = parser.parse(get_molnet_filepath(dataset_name),
return_smiles=get_smiles,
target_index=target_index, **kwargs)
dataset = result['dataset']
smiles = result['smiles']
train_ind, valid_ind, test_ind = \
splitter.train_valid_test_split(dataset, smiles_list=smiles,
task_index=task_index,
frac_train=frac_train,
frac_valid=frac_valid,
frac_test=frac_test, **kwargs)

Then, in the splitter (here the ScaffoldSplitter), the seed argument is still None:

def train_valid_test_split(self, dataset, smiles_list, frac_train=0.8,
frac_valid=0.1, frac_test=0.1, converter=None,
return_index=True, seed=None,
include_chirality=False, **kwargs):

According to the implementation, the seed=None means it be initialized by reading data from /dev/urandom according to the numpy docs.

def _split(self, dataset, frac_train=0.8, frac_valid=0.1, frac_test=0.1,
**kwargs):
numpy.testing.assert_almost_equal(frac_train + frac_valid + frac_test,
1.)
seed = kwargs.get('seed', None)
smiles_list = kwargs.get('smiles_list')
include_chirality = kwargs.get('include_chirality')
if len(dataset) != len(smiles_list):
raise ValueError("The lengths of dataset and smiles_list are "
"different")
rng = numpy.random.RandomState(seed)

This bug will cause data split inconsistent across different models and different run, even if we explicitly specify the same seed, and the default seed 777 here is useless.

PS: I use Pycharm debug tool to validate above procedure.

@corochann
Copy link
Member

corochann commented Mar 14, 2020

Thank you for report. @Minys233

I think you are right. Seems we need to change this line into splitter = split_method_dict[split](seed=seed).

splitter = split_method_dict[split]()

JFYI: sorry that we moved to maintenance phase and development is not so active recently. But I can merge the PR for bug fix.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants