Cluster missing executions #71

Luzifer · 2016-02-05T10:53:01Z

As an administrator of a dkron cluster I want my users to have a reliable cron system which guarantees "at least once" execution of their cron tasks.

My setup:

3 AWS EC2s running CoreOS stable, all three machines being part of an etcd cluster, all three machines running dkron 0.6.3 as a cluster. All machines does have role: executor tag and the following cron:

{
    "name": "date",
    "schedule": "0 */5 * * * *",
    "command": "/usr/bin/date",
    "owner": "Knut Ahlers",
    "owner_email": "knut****.com",
    "run_as_user": "core",
    "success_count": 58,
    "error_count": 0,
    "last_success": "2016-02-05T10:45:00.00621402Z",
    "last_error": "0001-01-01T00:00:00Z",
    "disabled": false,
    "tags": {
        "role": "executor:1"
    }
}

Expected result:

The cron is executed every 5 minutes and will succeed every single time as it just calls /usr/bin/date (which is the location of the date utility on CoreOS).

Observed result:

The task is sent to one of the machines every 5 minutes and gets eventually executed. The cron is missing several executions having no end date, no running date utility on the machine and a failed state.

Question

Can you help me with this / tell me what I'm doing wrong? Is there anything I can do to prevent this?

The text was updated successfully, but these errors were encountered:

vcastellm · 2016-02-06T23:52:54Z

Weir one, could you plese enable the debug flag and send me the log of the three nodes? Thanks!

Luzifer · 2016-02-08T15:12:16Z

Found an interesting part in the logs: I did not have a rule in the firewall configuration for that cluster to allow connections on port 6868 (RPC). After opening that port I get reports for all runs until now.

Though there are jobs assigned to a machine, which left the cluster (status left instead of failed), after it left. In that case the job execution just vanishes and is not recorded in the interface…

Logs from the current leader after its takeover from the node which left the cluster at 13:37:

# journalctl -u dkron | grep -E '(246850c9e0a46d3182e2fb2ad24b0d884d2eca13|10-8-4-212)'
[...]
Feb 08 13:37:56 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:37:56Z" level=debug msg="store: Retrieved leader from datastore" key=246850c9e0a46d3182e2fb2ad24b0d884d2eca13
Feb 08 13:37:56 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:37:56Z" level=debug msg="store: Retrieved leader from datastore" key=246850c9e0a46d3182e2fb2ad24b0d884d2eca13
Feb 08 13:37:56 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:37:56Z" level=debug msg="store: Leader Swap" new_leader=e3d721b9473b3cb6a9241f34ce7b0f16000ed331 old_leader=246850c9e0a46d3182e2fb2ad24b0d884d2eca13
Feb 08 13:50:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:50:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 13:55:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T13:55:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 14:15:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T14:15:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 15:10:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:10:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
[...]

Logs from 15:00:

# journalctl -u dkron | grep -E 'Feb 08 15:00'
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="scheduler: Run job" job=date schedule="0 */5 * * * *"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-211.eu-west-1.compute.internal]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered tags to run: map[role:executor:1]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Sending query" job_name=date json="{\"job_name\":\"date\",\"started_at\":\"0001-01-01T00:00:00Z\",\"finished_at\":\"0001-01-01T00:00:00Z\",\"group\":1454943600000392357,\"job\":{\"name\":\"date\",\"schedule\":\"0 */5 * * * *\",\"command\":\"/usr/bin/date\",\"owner\":\"Knut Ahlers\",\"owner_email\":\"knut.ahlers@jimdo.com\",\"run_as_user\":\"core\",\"success_count\":1,\"error_count\":0,\"last_success\":\"2016-02-08T13:10:00.479231499Z\",\"last_error\":\"0001-01-01T00:00:00Z\",\"disabled\":false,\"tags\":{\"role\":\"executor:1\"}}}" query="run:job"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="scheduler: Run job" job="cluster_health" schedule="0 */5 * * * *"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered nodes to run: [ip-10-8-4-212.eu-west-1.compute.internal]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Filtered tags to run: map[role:executor:1]"
Feb 08 15:00:00 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:00Z" level=debug msg="agent: Sending query" job_name="cluster_health" json="{\"job_name\":\"cluster_health\",\"started_at\":\"0001-01-01T00:00:00Z\",\"finished_at\":\"0001-01-01T00:00:00Z\",\"group\":1454943600000550408,\"job\":{\"name\":\"cluster_health\",\"schedule\":\"0 */5 * * * *\",\"command\":\"/usr/bin/etcdctl cluster-health\",\"owner\":\"Knut Ahlers\",\"owner_email\":\"knut.ahlers@jimdo.com\",\"run_as_user\":\"core\",\"success_count\":0,\"error_count\":0,\"last_success\":\"0001-01-01T00:00:00Z\",\"last_error\":\"0001-01-01T00:00:00Z\",\"disabled\":false,\"tags\":{\"role\":\"executor:1\"}}}" query="run:job"
Feb 08 15:00:03 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:03Z" level=debug msg="agent: Done receiving acks and responses" query="run:job"
Feb 08 15:00:03 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:03Z" level=debug msg="agent: Done receiving acks and responses" query="run:job"
Feb 08 15:00:07 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:07Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:07 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:07Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:27 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:27Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:27 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:27Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:47 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:47Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331
Feb 08 15:00:47 ip-10-8-3-141.eu-west-1.compute.internal bash[1199]: time="2016-02-08T15:00:47Z" level=debug msg="store: Retrieved leader from datastore" key=e3d721b9473b3cb6a9241f34ce7b0f16000ed331

Both job executions vanished because they were sent to machines which already left the cluster (exchanged by the autoscaling-group).

I think this may be the case because in agent.go:743 there is no check for the state (left / failed) of the serf member and the leader still having knowledge about former members of the cluster… (But this is only a quick assumption because I did not read all of your code yet!)

vcastellm · 2016-02-08T16:33:27Z

@Luzifer both things makes sense

allow connections on port 6868 (RPC)

This is necessary for the clients to send back execution reports with a tolerable size as serf doesn't support large messages (yet) I'll make a statement in the docs.

Regarding the vanished executions due to autoscaling, this is a good catch! indeed a bug, I missed the status when getting the cluster members.

Expect a hotfix release the following days.

Currently we're including in the node filtering, nodes that could potentially be gone. fixes #71

vcastellm pushed a commit that referenced this issue Feb 10, 2016

Fixes node processing with wrong status

e7a4888

Currently we're including in the node filtering, nodes that could potentially be gone. fixes #71

vcastellm mentioned this issue Feb 10, 2016

Fixes node processing with wrong status #73

Merged

vcastellm closed this as completed in #73 Feb 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster missing executions #71

Cluster missing executions #71

Luzifer commented Feb 5, 2016

vcastellm commented Feb 6, 2016

Luzifer commented Feb 8, 2016

vcastellm commented Feb 8, 2016

Cluster missing executions #71

Cluster missing executions #71

Comments

Luzifer commented Feb 5, 2016

My setup:

Expected result:

Observed result:

Question

vcastellm commented Feb 6, 2016

Luzifer commented Feb 8, 2016

vcastellm commented Feb 8, 2016