Name		Name	Last commit message	Last commit date
parent directory ..
img		img
README.md		README.md
trainer_cfg_small.yml		trainer_cfg_small.yml
trainer_cfg_tiny.yml		trainer_cfg_tiny.yml
trainer_cfg_v0-10-rollout-32-batch.yml		trainer_cfg_v0-10-rollout-32-batch.yml
trainer_cfg_v0-10-rollout-4-dim.yml		trainer_cfg_v0-10-rollout-4-dim.yml
trainer_cfg_v0-10-rollout.yml		trainer_cfg_v0-10-rollout.yml
trainer_cfg_v0-huge.yml		trainer_cfg_v0-huge.yml
trainer_cfg_v0.yml		trainer_cfg_v0.yml
v0.py		v0.py

README.md

first batch of v0 models

This is the first attempt of a mass scale training. We only trained the original small model (twice due to cluster interruption) and 10-rollout model (twice due to cluster interruption), plus a huge one which is 10-dimensional with 100 points. The result is less than ideal. But it marks a first systematic attempt of training.

10-dim models have gigantic amount of logs so we omit them.

Code: v0.py

Metrics: As introduced in the main repo, the quantity $\rho$ is a good measurement of relative strength between host and agent. We measure $\rho$ of

(host_net, agent_net),
(host_net, RandomAgent),
(agent_net, RandomHost),
(RandomHost, RandomAgent)

every 1000 steps as well as in the end. (The last pair (RandomHost, RandomAgent) only served as a sanity test. It should be a constant depending only on the dimension and max number of points.)

Method: DQN (from DQNTrainer whose core logic is exactly the same as DQN from stable-baseline3). We train the pair of host and agent simultaneously for 10 steps, generate new experiences and rinse-and-repeat.

It is well-known and easily observed that, since the metric is only relative with respect to the pair (host, agent), they develop their own preferences that do not generalize (as well as go, chess, etc.). As a result, the (host, agent) pair develops some kind of equilibrium and stops improving (converged). For every config script, we train 8 pairs of (host, agent) in parallel with randomized initialization and watch their behaviors.

To break out of the equilibrium, a proper algorithm should also include model selection by evaluating players across different pairs. This is not dealt with in the current experiment. Challenges include experiment design and GPU sync.

Observations: All the models seem to converge somewhere. But Hosts are significantly weaker than Agents. Agents have certain amount of generalizability when compared with other strategies (including RandomHost), but Host does not at all.

Charts: the loss (host_net and agent_net) and the $\rho$ ([host_net, RandomHost] vs [agent_net, RandomAgent], every 1000 step)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0

v0

README.md

first batch of v0 models

Files

v0

Directory actions

More options

Directory actions

More options

Latest commit

History

v0

Folders and files

parent directory

README.md

first batch of v0 models