Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

etcdserver/ARM: starting etcd on crashes the first time, succeeds subsequently #2308

Closed
miekg opened this issue Feb 14, 2015 · 25 comments
Closed

Comments

@miekg
Copy link

miekg commented Feb 14, 2015

Title says it all. First time -> crash, second time -> works (this is raspberry pi b+)

2015/02/14 07:47:55 no data-dir provided, using default data-dir ./pi0.etcd
2015/02/14 07:47:55 etcd: listening for peers on http://10.248.0.1:2380
2015/02/14 07:47:55 etcd: listening for client requests on http://localhost:2379
2015/02/14 07:47:55 etcd: listening for client requests on http://localhost:4001
2015/02/14 07:47:55 etcdserver: name = pi0
2015/02/14 07:47:55 etcdserver: data dir = pi0.etcd
2015/02/14 07:47:55 etcdserver: member dir = pi0.etcd/member
2015/02/14 07:47:55 etcdserver: heartbeat = 100ms
2015/02/14 07:47:55 etcdserver: election = 1000ms
2015/02/14 07:47:55 etcdserver: snapshot count = 10000
2015/02/14 07:47:55 etcdserver: advertise client URLs = http://localhost:2379,http://localhost:4001
2015/02/14 07:47:55 etcdserver: initial advertise peer URLs = http://10.248.0.1:2380
2015/02/14 07:47:55 etcdserver: initial cluster = pi0=http://10.248.0.1:2380,pi1=http://10.248.0.2:2380,pi2=http://10.248.0.3:2380
2015/02/14 07:47:55 etcdserver: start member 1dc862383bd186a1 in cluster 985788d45a3370a0
2015/02/14 07:47:55 raft: 1dc862383bd186a1 became follower at term 0
2015/02/14 07:47:55 raft: newRaft 1dc862383bd186a1 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2015/02/14 07:47:55 raft: 1dc862383bd186a1 became follower at term 1
2015/02/14 07:47:55 etcdserver: added local member 1dc862383bd186a1 [http://10.248.0.1:2380] to cluster 985788d45a3370a0
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x4 pc=0xb9dd0]

goroutine 18 [running]:
sync/atomic.storeUint64(0x1073065c, 0x1, 0x0)
        /home/miek/upstream/go/src/sync/atomic/64bit_arm.go:20 +0x48
github.com/coreos/etcd/etcdserver.(*EtcdServer).apply(0x10730630, 0x107882c0, 0x3, 0x4, 0x1078a380, 0x0, 0x0, 0x0)
        /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:696 +0x210
github.com/coreos/etcd/etcdserver.(*EtcdServer).run(0x10730630)
        /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:398 +0xfd0
created by github.com/coreos/etcd/etcdserver.(*EtcdServer).start
        /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:294 +0x2cc

goroutine 1 [runnable]:
github.com/coreos/etcd/etcdmain.Main()
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/etcd.go:79 +0x56c
main.main()
        /home/miek/g/src/github.com/coreos/etcd/main.go:48 +0x250

goroutine 5 [syscall]:
os/signal.loop()
        /home/miek/upstream/go/src/os/signal/signal_unix.go:21 +0x1c
created by os/signal.init·1
        /home/miek/upstream/go/src/os/signal/signal_unix.go:27 +0x40

goroutine 19 [runnable]:
github.com/coreos/etcd/etcdserver.(*EtcdServer).Do(0x10730630, 0xb6eef6f8, 0x107a0180, 0xe70d0001, 0xa14b870e, 0x38e5a0, 0x3, 0x10732510, 0x26, 0x10710190, ...)
        /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:488 +0x49c
github.com/coreos/etcd/etcdserver.(*EtcdServer).publish(0x10730630, 0x2a05f200, 0x1)
        /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:650 +0x294
created by github.com/coreos/etcd/etcdserver.(*EtcdServer).Start
        /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:275 +0x58

goroutine 20 [runnable]:
github.com/coreos/etcd/etcdserver.(*EtcdServer).purgeFile(0x10730630)
        /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:305 +0x38c
created by github.com/coreos/etcd/etcdserver.(*EtcdServer).Start
        /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:276 +0x80

goroutine 21 [IO wait]:
net.(*pollDesc).Wait(0x107370f8, 0x72, 0x0, 0x0)
        /home/miek/upstream/go/src/net/fd_poll_runtime.go:84 +0x3c
net.(*pollDesc).WaitRead(0x107370f8, 0x0, 0x0)
        /home/miek/upstream/go/src/net/fd_poll_runtime.go:89 +0x38
net.(*netFD).accept(0x107370c0, 0x0, 0xb6eedab0, 0x1073a638)
        /home/miek/upstream/go/src/net/fd_unix.go:419 +0x390
net.(*TCPListener).AcceptTCP(0x1071c620, 0x0, 0x0, 0x0)
        /home/miek/upstream/go/src/net/tcpsock_posix.go:234 +0x50
net.(*TCPListener).Accept(0x1071c620, 0x0, 0x0, 0x0, 0x0)
        /home/miek/upstream/go/src/net/tcpsock_posix.go:244 +0x3c
github.com/coreos/etcd/pkg/transport.(*rwTimeoutListener).Accept(0x1078a1c0, 0x0, 0x0, 0x0, 0x0)
        /home/miek/g/src/github.com/coreos/etcd/pkg/transport/timeout_listener.go:44 +0x64
net/http.(*Server).Serve(0x107a0280, 0xb6eef018, 0x1078a1c0, 0x0, 0x0)
        /home/miek/upstream/go/src/net/http/server.go:1728 +0x98
github.com/coreos/etcd/etcdmain.serveHTTP(0xb6eef018, 0x1078a1c0, 0xb6eef638, 0x1078a280, 0xd964b800, 0x45, 0x0, 0x0)
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/http.go:36 +0x1b8
github.com/coreos/etcd/etcdmain.func·005(0xb6eef018, 0x1078a1c0)
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/etcd.go:180 +0x58
created by github.com/coreos/etcd/etcdmain.startEtcd
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/etcd.go:118 +0x17d8

goroutine 22 [IO wait]:
net.(*pollDesc).Wait(0x107371b8, 0x72, 0x0, 0x0)
        /home/miek/upstream/go/src/net/fd_poll_runtime.go:84 +0x3c
net.(*pollDesc).WaitRead(0x107371b8, 0x0, 0x0)
        /home/miek/upstream/go/src/net/fd_poll_runtime.go:89 +0x38
net.(*netFD).accept(0x10737180, 0x0, 0xb6eedab0, 0x1073a650)
        /home/miek/upstream/go/src/net/fd_unix.go:419 +0x390
net.(*TCPListener).AcceptTCP(0x1071c6f0, 0x0, 0x0, 0x0)
        /home/miek/upstream/go/src/net/tcpsock_posix.go:234 +0x50
net.(*TCPListener).Accept(0x1071c6f0, 0x0, 0x0, 0x0, 0x0)
        /home/miek/upstream/go/src/net/tcpsock_posix.go:244 +0x3c
github.com/coreos/etcd/pkg/transport.(*keepaliveListener).Accept(0x1071c6f8, 0x0, 0x0, 0x0, 0x0)
        /home/miek/g/src/github.com/coreos/etcd/pkg/transport/keepalive_listener.go:48 +0x64
net/http.(*Server).Serve(0x107a02c0, 0xb6eef038, 0x1071c6f8, 0x0, 0x0)
        /home/miek/upstream/go/src/net/http/server.go:1728 +0x98
github.com/coreos/etcd/etcdmain.serveHTTP(0xb6eef038, 0x1071c6f8, 0xb6eef720, 0x1073a450, 0x0, 0x0, 0x0, 0x0)
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/http.go:36 +0x1b8
github.com/coreos/etcd/etcdmain.func·006(0xb6eef038, 0x1071c6f8)
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/etcd.go:188 +0x8c
created by github.com/coreos/etcd/etcdmain.startEtcd
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/etcd.go:118 +0x18e0

goroutine 23 [IO wait]:
net.(*pollDesc).Wait(0x10737278, 0x72, 0x0, 0x0)
        /home/miek/upstream/go/src/net/fd_poll_runtime.go:84 +0x3c
net.(*pollDesc).WaitRead(0x10737278, 0x0, 0x0)
        /home/miek/upstream/go/src/net/fd_poll_runtime.go:89 +0x38
net.(*netFD).accept(0x10737240, 0x0, 0xb6eedab0, 0x1073a6b8)
        /home/miek/upstream/go/src/net/fd_unix.go:419 +0x390
net.(*TCPListener).AcceptTCP(0x1071c740, 0x0, 0x0, 0x0)
        /home/miek/upstream/go/src/net/tcpsock_posix.go:234 +0x50
net.(*TCPListener).Accept(0x1071c740, 0x0, 0x0, 0x0, 0x0)
        /home/miek/upstream/go/src/net/tcpsock_posix.go:244 +0x3c
github.com/coreos/etcd/pkg/transport.(*keepaliveListener).Accept(0x1071c748, 0x0, 0x0, 0x0, 0x0)
        /home/miek/g/src/github.com/coreos/etcd/pkg/transport/keepalive_listener.go:48 +0x64
net/http.(*Server).Serve(0x107a0300, 0xb6eef038, 0x1071c748, 0x0, 0x0)
        /home/miek/upstream/go/src/net/http/server.go:1728 +0x98
github.com/coreos/etcd/etcdmain.serveHTTP(0xb6eef038, 0x1071c748, 0xb6eef720, 0x1073a450, 0x0, 0x0, 0x0, 0x0)
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/http.go:36 +0x1b8
github.com/coreos/etcd/etcdmain.func·006(0xb6eef038, 0x1071c748)
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/etcd.go:188 +0x8c
created by github.com/coreos/etcd/etcdmain.startEtcd
        /home/miek/g/src/github.com/coreos/etcd/etcdmain/etcd.go:118 +0x18e0

goroutine 24 [chan receive]:
github.com/coreos/etcd/pkg/osutil.func·001()
        /home/miek/g/src/github.com/coreos/etcd/pkg/osutil/osutil.go:66 +0x50
created by github.com/coreos/etcd/pkg/osutil.HandleInterrupts
        /home/miek/g/src/github.com/coreos/etcd/pkg/osutil/osutil.go:82 +0x1d4

goroutine 25 [runnable]:
github.com/coreos/etcd/pkg/fileutil.func·001()
        /home/miek/g/src/github.com/coreos/etcd/pkg/fileutil/purge.go:69 +0x5d0
created by github.com/coreos/etcd/pkg/fileutil.PurgeFile
        /home/miek/g/src/github.com/coreos/etcd/pkg/fileutil/purge.go:75 +0x26c

goroutine 26 [runnable]:
github.com/coreos/etcd/pkg/fileutil.func·001()
        /home/miek/g/src/github.com/coreos/etcd/pkg/fileutil/purge.go:69 +0x5d0
created by github.com/coreos/etcd/pkg/fileutil.PurgeFile
        /home/miek/g/src/github.com/coreos/etcd/pkg/fileutil/purge.go:75 +0x26c
@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2015

@miekg Can you reliably reproduce this? It is a transient bug?

@miekg
Copy link
Author

miekg commented Feb 14, 2015

Yes. Deleting the datadir re-triggers this.
On 14 Feb 2015 16:42, "Xiang Li" notifications@github.com wrote:

@miekg https://github.com/miekg Can you reliably reproduce this? It is
a transient bug?


Reply to this email directly or view it on GitHub
#2308 (comment).

@xiang90 xiang90 changed the title Starting etcd on arm crashes the first time, succeeds subsequently etcdserver/ARM: starting etcd on crashes the first time, succeeds subsequently Feb 14, 2015
@xiang90
Copy link
Contributor

xiang90 commented Feb 14, 2015

Yes. Deleting the datadir re-triggers this.

I tend to think this is an arm specific problem. I cannot reproduce this on x86.
I do not have arm env around. I think it should be easy to debug by adding a few checking around the panic line.
If you have time, can you help to fix it? We probably make some bad assumptions or stupid mistakes since it can be reliably reproduced on your env.

@miekg
Copy link
Author

miekg commented Feb 14, 2015

From some prints the error seems to actual be in

/home/miek/upstream/go/src/sync/atomic/64bit_arm.go:20 +0x48

Which makes it a wider Go (arm) problem.

@miekg
Copy link
Author

miekg commented Feb 14, 2015

This atomic.StoreUint64 is fishy on arm (pi b+), if i just use assignment, i.e.

diff --git a/etcdserver/server.go b/etcdserver/server.go
index 0249506..569de00 100644
--- a/etcdserver/server.go
+++ b/etcdserver/server.go
@@ -693,8 +693,10 @@ func (s *EtcdServer) apply(es []raftpb.Entry, confState *raftpb
                default:
                        log.Panicf("entry type should be either EntryNormal or Entry
                }
-               atomic.StoreUint64(&s.r.index, e.Index)
-               atomic.StoreUint64(&s.r.term, e.Term)
+               s.r.index = e.Index
+               //atomic.StoreUint64(&s.r.index, e.Index)
+               s.r.term = e.Term
+               //atomic.StoreUint64(&s.r.term, e.Term)
                applied = e.Index
        }
        return applied, shouldstop

It crashes on

goroutine 18 [running]:
sync/atomic.storeUint64(0x1070a66c, 0x0, 0x0)
    /home/miek/upstream/go/src/sync/atomic/64bit_arm.go:20 +0x48
github.com/coreos/etcd/etcdserver.(*EtcdServer).run(0x1070a630)
    /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:359 +0x378
created by github.com/coreos/etcd/etcdserver.(*EtcdServer).start
    /home/miek/g/src/github.com/coreos/etcd/etcdserver/server.go:294 +0x2cc

which is another storeUint64

@miekg
Copy link
Author

miekg commented Feb 14, 2015

ok, as starting point: the following just works.

package main

import "sync/atomic"

type X struct {
    index uint64
}

func main() {
    x := new(X)
    println(x.index)
    atomic.StoreUint64(&x.index, 1)
    println(x.index)
}

@miekg
Copy link
Author

miekg commented Feb 14, 2015

This bug (from the docs) maybe:

On both ARM and x86-32, it is the caller's responsibility to arrange for
64-bit alignment of 64-bit words accessed atomically. The first word in a
global variable or in an allocated struct or slice can be relied upon to
be 64-bit aligned.

@davecheney
Copy link

This is one problem, but looking at the original post there may be two.

The first is an honest to god nil pointer exception, something is passing 0x4 as the address of the unit64 inside the struct, so that is nil + offset of the field.

The second problem is while the runtime guarantees that structures will always be heap aligned on 8 byte boundaries, if you have something like this

type T struct {
       p *Something
       val uint64
...
}

val will be 4 bytes offset from the start of the structure on 32 bit platforms. The 32 bit compilers do not insert the correct padding here. There are a number of ways to tackle this, the simplest is to move val to the top of the structure so it will be 8 byte aligned. You could also use built tags to provider a properly padded version of the struct depending on the platform.

@miekg
Copy link
Author

miekg commented Feb 15, 2015

thanks @davecheney

This patch makes the crash go away (fixes the allignment).
(Still wondering about that nil pointer though)

diff --git a/etcdserver/raft.go b/etcdserver/raft.go
index 0d7d5d0..b533078 100644
--- a/etcdserver/raft.go
+++ b/etcdserver/raft.go
@@ -37,6 +37,11 @@ type RaftTimer interface {
 }

 type raftNode struct {
+       // Cache of the latest raft index and raft term the server has seen
+       index uint64
+       term  uint64
+       lead  uint64
+
        raft.Node

        // config
@@ -51,11 +56,6 @@ type raftNode struct {
        // clients should timeout and reissue their messages.
        // If transport is nil, server will panic.
        transport rafthttp.Transporter
-
-       // Cache of the latest raft index and raft term the server has seen
-       index uint64
-       term  uint64
-       lead  uint64
 }

 // for testing
diff --git a/etcdserver/server.go b/etcdserver/server.go
index 0249506..6b7a9d2 100644
--- a/etcdserver/server.go
+++ b/etcdserver/server.go
@@ -113,10 +113,9 @@ type Server interface {

 // EtcdServer is the production implementation of the Server interface
 type EtcdServer struct {
+       r   raftNode
        cfg *ServerConfig

-       r raftNode
-
        w          wait.Wait
        stop       chan struct{}
        done       chan struct{}

@davecheney
Copy link

LGTM. What is the definition of raft.Node ?

On 15 Feb 2015, at 18:41, Miek Gieben notifications@github.com wrote:

thanks @davecheney

This patch makes the crash go away (fixes the allignment).
(Still wondering about that nil pointer though)

diff --git a/etcdserver/raft.go b/etcdserver/raft.go
index 0d7d5d0..b533078 100644
--- a/etcdserver/raft.go
+++ b/etcdserver/raft.go
@@ -37,6 +37,11 @@ type RaftTimer interface {
}

type raftNode struct {

  •   // Cache of the latest raft index and raft term the server has seen
    
  •   index uint64
    
  •   term  uint64
    
  •   lead  uint64
    
    • raft.Node
    // config
    

    @@ -51,11 +56,6 @@ type raftNode struct {
    // clients should timeout and reissue their messages.
    // If transport is nil, server will panic.

    transport rafthttp.Transporter

  •   // Cache of the latest raft index and raft term the server has seen
    
  •   index uint64
    
  •   term  uint64
    
  •   lead  uint64
    

    }

    // for testing
    diff --git a/etcdserver/server.go b/etcdserver/server.go
    index 0249506..6b7a9d2 100644
    --- a/etcdserver/server.go
    +++ b/etcdserver/server.go
    @@ -113,10 +113,9 @@ type Server interface {

    // EtcdServer is the production implementation of the Server interface
    type EtcdServer struct {

  •   r   raftNode
    cfg *ServerConfig
    

- r raftNode

    w          wait.Wait
    stop       chan struct{}
    done       chan struct{}


Reply to this email directly or view it on GitHub.

@xiang90
Copy link
Contributor

xiang90 commented Feb 15, 2015

@miekg I do not think you are 100% safe by only changing this.

We are not in control of all the dependencies and are not sure of other sub pkgs in etcd. We probably need more effort to say we do support 32bit or arm well enough.

@miekg
Copy link
Author

miekg commented Feb 15, 2015

@xiang90 Ack. At least I can play some more now :)

etcdserver/raft.go has the def (note this is my changed struct)

type raftNode struct {
    // Cache of the latest raft index and raft term the server has seen
    index uint64
    term  uint64
    lead  uint64

    raft.Node

    // config
    snapCount uint64 // number of entries to trigger a snapshot

    // utility
    ticker      <-chan time.Time
    raftStorage *raft.MemoryStorage
    storage     Storage
    // transport specifies the transport to send and receive msgs to members.
    // Sending messages MUST NOT block. It is okay to drop messages, since
    // clients should timeout and reissue their messages.
    // If transport is nil, server will panic.
    transport rafthttp.Transporter
}

@Audumla
Copy link

Audumla commented Mar 4, 2015

I just made the above changes and this is now working on an odroid u3 as well. It would be great to have this working out of the box though.

@miekg
Copy link
Author

miekg commented Mar 4, 2015

[ Quoting notifications@github.com in "Re: [etcd] etcdserver/ARM: starting..." ]

I just made the above changes and this is now working on an odroid u3 as well. It would be great to have this working out of the box though.

Note that the same bug is triggered on 32 bit intel.

(my lousy virtual machine which I sometimes use for development, is only 32 bits)

@AlexeyRaga
Copy link

@Audumla do you have a repo/changeset I can build from for odroid?

@Audumla
Copy link

Audumla commented Mar 12, 2015

I did not create one for this, as I just made the change to the file mentioned above and rebuilt using the code from a master git repository. It just worked!
If I get a chance Ill make a branch, but that might be weeks away due to time restrictions.

@hh
Copy link

hh commented Apr 9, 2015

I put this into a changeset and rebuilt:

https://github.com/hh/etcd/tree/32bit

I had to start with a clean data-dir, but it seems to be working.

hh@3060f86

@ajazam
Copy link

ajazam commented Apr 11, 2015

@hh. I've used your version of the source code, with go 1.4.2 (built from source) and I'm getting

etcdserver: publish error: etcdserver: request timed out

My go skills are non existent, otherwise I would have a look at the code.

I'm building and running it on a raspberry pi 2 with hypriot image(http://blog.hypriot.com/post/hypriotos-back-again-with-docker-on-arm/)

@ajazam
Copy link

ajazam commented Apr 11, 2015

I've found some patches described on http://mkaczanowski.com/building-arm-cluster-part-3-docker-fleet-etcd-distribute-containers/#install_docker to make etcd 2.0.4 run on a raspberry pi. Is it possible to merge these changes ? They are the same as @hh but for some reason etcd works with those patches. Unless I'm being a noob.

@matiwinnetou
Copy link

+1 for patch on raspberry pi (ARM)

@matiwinnetou
Copy link

getting exact same error as @ajazam also on raspberry pi 2 and go 1.4.2

@miekg
Copy link
Author

miekg commented May 24, 2015

@luxas
Copy link
Contributor

luxas commented Jun 6, 2015

Applied the patches and etcd works fine for now.
+1 for 32-bit branch or something until etcd is completely cross-platform

Now I should step forward to next problem...
I'm building a "cloud-in-a-box" with Raspberry Pi 2 and Kubernetes, not the easiest one :)

@rcarmo
Copy link

rcarmo commented Jun 19, 2015

I've come across this as well and confirm that the patches from @mkaczanowski work. Changing struct alignment seems to be the key here, and I now have etcd running happily on both a Raspberry Pi 2 and and ODROID-U3 (both running Ubuntu 14.04 armhf).

It might be worthwhile applying these on an ARM-related branch (just for archival purposes) before merging...

@xiang90
Copy link
Contributor

xiang90 commented Aug 9, 2015

Fixed via #3249

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

No branches or pull requests

10 participants