Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

apk fetch hangs #307

Closed
dkirrane opened this issue Jul 17, 2017 · 39 comments
Closed

apk fetch hangs #307

dkirrane opened this issue Jul 17, 2017 · 39 comments
Labels

Comments

@dkirrane
Copy link

fetch of the apk index just hangs. I hit this now on a Ubuntu server and Docker for Windows

Step 1/14 : FROM maven:3.3.9-jdk-8-alpine
 ---> dd9d4e1cd9db
Step 2/14 : RUN apk update && apk upgrade       && apk add --no-cache --update  ca-certificates         bash    wget    curl    tree    libxml2-utils   putty   git     && rm -rf /var/lib/apt/lists/*     && rm -rf /var/cache/apk/*
 ---> Running in 536cbd484c36
fetch http://dl-cdn.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz

Docker version 17.03.1-ce, build c6d412e

@andyshinn
Copy link
Contributor

Are you able to start another container shell and curl dl-cdn.alpinelinux.org? Sounds like a networking issue somewhere.

@dkirrane
Copy link
Author

dkirrane commented Jul 24, 2017

Seems like a DNS issue. Not sure why, I've set correct dns settings in %programdata%\docker\config\daemon.json

nslookup dl-cdn.alpinelinux.org
nslookup: can't resolve '(null)': Name does not resolve

Name:      dl-cdn.alpinelinux.org
Address 1: 151.101.48.249

Got around this by running using https

RUN sed -i 's/http\:\/\/dl-cdn.alpinelinux.org/https\:\/\/alpine.global.ssl.fastly.net/g' /etc/apk/repositories

@charlescanato
Copy link

Doesn't seem like a DNS issue, since it has resolved. Unfortunately, for while, I'm at the same point: names are resolved, but can't connect to anything.

@andyshinn
Copy link
Contributor

In another issue I thought this could be a DNS issue because the CDN POP IP addresses may change more frequently. If DNS is being cached somewhere and TTL not honored then an outdated IP address may be returned for dl-cdn.alpinelinux.org.

This is why I need debugging information to help pinpoint it. When the issue happens, I need the curl -v -s http://dl-cdn.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz > /dev/null output from inside a container and the host so we can compare.

@kachkaev
Copy link

kachkaev commented Oct 14, 2017

Also facing this thing from time to time. Here’s a typical output of GitLab CI when fetching fails:

screen shot 2017-10-14 at 15 19 48

Manually stopping and retrying the stuck CI job helps, but there’s no guarantee of reliability

@antonmarin
Copy link

antonmarin commented Dec 4, 2017

/ # curl -v -s http://dl-cdn.alpinelinux.org/alpine/v3.5/main/x86_64/APKINDEX.tar.gz > /de
v/null
*   Trying 151.101.84.249...
* TCP_NODELAY set
* Connected to dl-cdn.alpinelinux.org (151.101.84.249) port 80 (#0)
> GET /alpine/v3.5/main/x86_64/APKINDEX.tar.gz HTTP/1.1
> Host: dl-cdn.alpinelinux.org
> User-Agent: curl/7.56.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: application/octet-stream
< Last-Modified: Mon, 04 Dec 2017 09:08:18 GMT
< ETag: "5a251082-b3195"
< Accept-Ranges: bytes
< Content-Length: 733589
< Accept-Ranges: bytes
< Date: Mon, 04 Dec 2017 16:03:17 GMT
< Via: 1.1 varnish
< Connection: keep-alive
< X-Served-By: cache-bma7035-BMA
< X-Cache: MISS
< X-Cache-Hits: 0
< X-Timer: S1512403397.178165,VS0,VE64
< 
{ [8688 bytes data]
* Connection #0 to host dl-cdn.alpinelinux.org left intact
/ # 

Success while another process hangs

Step 3/12 : RUN apk add --no-cache --update     icu-dev     libxml2-dev     openldap-dev     php7-xdebug     && mv /usr/lib/php7/modules/xdebug.so /usr/local/lib/php/extensions/no-debug-non-zts-20160303     && rm -f /etc/php7/conf.d/xdebug.ini     && docker-php-ext-install     intl     ldap     soap     && curl -sS https://getcomposer.org/installer | php -- --install-dir=/usr/local/bin --filename=composer
 ---> Running in 1e9ed33cd4b3
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz

One more with matching url

/ # curl -v -s http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz > /de
v/null
*   Trying 151.101.84.249...
* TCP_NODELAY set
* Connected to dl-cdn.alpinelinux.org (151.101.84.249) port 80 (#0)
> GET /alpine/v3.4/main/x86_64/APKINDEX.tar.gz HTTP/1.1
> Host: dl-cdn.alpinelinux.org
> User-Agent: curl/7.56.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Server: nginx
< Content-Type: application/octet-stream
< Last-Modified: Thu, 23 Nov 2017 09:53:36 GMT
< ETag: "5a169aa0-a53fe"
< Accept-Ranges: bytes
< Content-Length: 676862
< Accept-Ranges: bytes
< Date: Mon, 04 Dec 2017 16:06:47 GMT
< Via: 1.1 varnish
< Connection: keep-alive
< X-Served-By: cache-bma7022-BMA
< X-Cache: MISS
< X-Cache-Hits: 0
< X-Timer: S1512403608.785535,VS0,VE75
< 
{ [2896 bytes data]
* Connection #0 to host dl-cdn.alpinelinux.org left intact
/ # 

@andremarianiello
Copy link

andremarianiello commented Dec 22, 2017

I ran into this issue in kubernetes. I bounced the kube-dns pod to flush any records it might be caching. This fixed the problem for me.

EDIT: Actually it didn't. Still having the problem.

@unkaktus
Copy link

Are you running Docker-in-Docker by any chance? I have this issue only in dind containers.

@andremarianiello
Copy link

I only have this with dind + kubernetes. However it doesn't happen if I use '--network host' or '--net host'. I am using weave overlay network.

@unkaktus
Copy link

unkaktus commented Jan 12, 2018

@andremarianiello thanks for info, I am also using dind+kubernetes (though with flannel). Have you enabled hostNetwork: true for dind pods?

@andremarianiello
Copy link

@nogoegst No I haven't. It worked without doing that.

@unkaktus
Copy link

@andremarianiello so you mean you set --network host for dockerized docker daemon? Where did you set it?

@andremarianiello
Copy link

I added it to my docker client commands, e.g. 'docker build --network host ...'

@neclimdul
Copy link

I've see the k8s issue quite a bit. wireshark shows fastly getting stuck sending oversized packets with a do not fragment flag. I don't think this is OP's issue though as its docker for windows.

I recently started running into a similar issue to this as well though. On linux but the behavior is the same, apk fails fetching and mostly on the index. Again, pulled up wireshark and recreated the problem. I see things going smoothly then the apk process seems to stop ACK'ing segments from the fastly server. Fastly starts throttling and resending segments and it lags out.

I've never recreated this with curl, but it looks like apk uses a built in BSD libfetch for its HTTP communications so maybe there's a bug in there?

My network communication understanding is just enough to get me this far so here's a link to the wireshark log of the communications. hopefully an alpine dev has a better understanding and can parse out a clue or find the problem.

@ncopa
Copy link
Collaborator

ncopa commented May 14, 2018

It seems like fastly is filtering ICMP need to frag packets, which means that PMTU does not work. This can be a problem is your traffic goes via a network link that has MTU lower than 1500 (typically tunnels/vpns, PPPoE and similar). This can be worked around by enabling tcp mss clamping in the network.

@neclimdul
Copy link

Yeah I was treating this as a different issue because it has slightly different characteristics and not the same as #279.

The Wireshark link in #307 (comment) shows a different traffic behavior. Instead of the traffic doesn't get killed at the bridge, it is never ACK'd by libfetch and Fastly's TCP session gets stuck trying to get recover. I don't know if it's even fastly's fault as on the surface it seems to be doing the right thing.

... misc traffic 
ack:                   container -> bridge -> fastly
Transmission:          container <- bridge <- fastly
Transmission:          container <- bridge <- fastly
ack:                   container -> bridge -> fastly
Transmission:          container <- bridge <- fastly
Transmission:          container <- bridge <- fastly
ack:                   container -> bridge -> fastly
Transmission:          container <- bridge <- fastly 1
Transmission:          container <- bridge <- fastly 2
Transmission:          container <- bridge <- fastly 3
Transmission:          container <- bridge <- fastly 4
Transmission:          container <- bridge <- fastly 5
... some number of other packets
Transmission:          container <- bridge <- fastly X
Transmission:          container <- bridge <- fastly 1
Transmission:          container <- bridge <- fastly 1
Transmission:          container <- bridge <- fastly 1
Transmission:          container <- bridge <- fastly 1
Transmission:          container <- bridge <- fastly 1
Transmission:          container <- bridge <- fastly 1
Transmission:          container <- bridge <- fastly 1
Transmission:          container <- bridge <- fastly 1
.... repeat

@neclimdul
Copy link

observation, networking is hard.

@evanrich
Copy link

evanrich commented Oct 5, 2018

I added it to my docker client commands, e.g. 'docker build --network host ...'

where exactly did you put this? I'm facing this on kubernetes right now, where gitlab spins up a container with docker running in docker...driving me nuts for the lasst 4 hours.

@andremarianiello
Copy link

@evanrich My gitlab CI was using docker:dind as a service container, and my main build container had a docker client in it which I used to connect to the service container. My repo has a dockerfile in it that I need to be built by the gitlab runner. My .gitlab-ci.yaml file contained the command

docker build .

This builds my docker image. One of my layers in the dockerfile runs apk update. This command hangs, causing the docker build command and the CI as a whole to fail.
However, if I modify my .gitlab-ci.yaml file to have

docker build --network host .

docker will run the apk update command from my dockerfile without hanging.

@ncopa
Copy link
Collaborator

ncopa commented Oct 5, 2018

I believe that the problem is that in docker the MTU is lower than on the host. The way this is supposed to work is via path MTU discovery, but fastly appears to block the PMTU icmp packet (I guess it is a part of their DDoS defence). The way to "fix" this properly is to enable MSS clamping on the host.
https://blog.ipspace.net/2013/01/tcp-mss-clamping-what-is-it-and-why-do.html

The other alternative is to use a different mirror that does not block the PMTU traffic.

@ncopa ncopa closed this as completed Oct 5, 2018
@andremarianiello
Copy link

@ncopa How can we check to see if our docker mtu is lower than our host mtu?

@evanrich
Copy link

evanrich commented Oct 5, 2018

@evanrich My gitlab CI was using docker:dind as a service container, and my main build container had a docker client in it which I used to connect to the service container. My repo has a dockerfile in it that I need to be built by the gitlab runner. My .gitlab-ci.yaml file contained the command

docker build .

This builds my docker image. One of my layers in the dockerfile runs apk update. This command hangs, causing the docker build command and the CI as a whole to fail.
However, if I modify my .gitlab-ci.yaml file to have

docker build --network host .

docker will run the apk update command from my dockerfile without hanging.

are you not using auto devops? I haven't specified a .gitlab-ci.yml file yet, I seem to have worked around part of it via switching to alpine.global.ssl.fastly.net, but i get this

Status: Downloaded newer image for golang:alpine
 ---> 95ec94706ff6
Step 2/13 : RUN sed -i 's/http\:\/\/dl-cdn.alpinelinux.org/https\:\/\/alpine.global.ssl.fastly.net/g' /etc/apk/repositories
 ---> Running in a3de349b32f8
Removing intermediate container a3de349b32f8
 ---> 39505fc0c5f2
Step 3/13 : RUN apk update;     apk add git gcc build-base;     go get -v github.com/cloudflare/cloudflared/cmd/cloudflared
 ---> Running in 548789a2500b
fetch https://alpine.global.ssl.fastly.net/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
fetch https://alpine.global.ssl.fastly.net/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
v3.8.1-22-g24d67bab3a [https://alpine.global.ssl.fastly.net/alpine/v3.8/main]
v3.8.1-16-g96e1e57fed [https://alpine.global.ssl.fastly.net/alpine/v3.8/community]
OK: 9539 distinct packages available
(1/25) Installing binutils (2.30-r5)

and it just hangs at installing binutils every time. Found this: #279 . seems to be a wide spread issue in k8s due to lower mtu.

I was able to get slightly further with changing my mirror from a fastly mirror to mirror.clarkson.edu using
RUN sed -i 's/http\:\/\/dl-cdn.alpinelinux.org/http\:\/\/mirror.clarkson.edu/g' /etc/apk/repositories

builds are running, will update when they finish.

Edit: Just finished successfully... build 174 (that's how many times it's taken trying to get this to work"

Removing intermediate container 5c42267a84e9
 ---> 339cedacd0cf
Step 12/13 : EXPOSE 54/udp
 ---> Running in 8308f4f1cb00
Removing intermediate container 8308f4f1cb00
 ---> b917125f9e41
Step 13/13 : EXPOSE 34411/tcp
 ---> Running in 5d3115c32a0f
Removing intermediate container 5d3115c32a0f
 ---> 33616623b643
Successfully built 33616623b643
Successfully tagged registry.evanrichardsonphotography.com/docker/cloudflared/master:a66a757bee6a6de2276ed4a8d3a8de121efc8705
Pushing to GitLab Container Registry...
The push refers to repository [registry.evanrichardsonphotography.com/docker/cloudflared/master]
75ddfc9ca656: Preparing
ff665015151e: Preparing
434f9e907dc9: Preparing
e834c1681702: Preparing
676adc5a23cc: Preparing
e834c1681702: Layer already exists
676adc5a23cc: Layer already exists
434f9e907dc9: Pushed
ff665015151e: Pushed
75ddfc9ca656: Pushed
a66a757bee6a6de2276ed4a8d3a8de121efc8705: digest: sha256:75efdf757e24da3a27a3674f49508e9f85d0d115e921231ae52835f56a28e1b7 size: 1368

Job succeeded

@unkaktus
Copy link

unkaktus commented Oct 5, 2018

On Kubernetes one should run these containers with hostNetwork: true.

yussufsh pushed a commit to yussufsh/deploy-ibm-cloud-private that referenced this issue Feb 1, 2019
Made this small change on the playbook to include `--network host` parameter on docker build command to avoid apk to hang when trying to fetch Alpine linux image, as described on [docker-alpine GItHub issue 307](gliderlabs/docker-alpine#307).
ikaruswill added a commit to ikaruswill/drone-yamls that referenced this issue Jan 22, 2020
GeraldWodni added a commit to GeraldWodni/kern.js that referenced this issue Feb 19, 2020
@Stark-X
Copy link

Stark-X commented May 22, 2020

Hi here, is there anyone still suffering this issue? Seems the issue has gone some how.
I can't reproduce the "hangs" now.

@smnbbrv
Copy link

smnbbrv commented May 22, 2020

20 days ago it still was present, see above

@Stark-X
Copy link

Stark-X commented May 22, 2020

20 days ago it still was present, see above

It still present last week, what about these 2days? could u have a try 👀

@smnbbrv
Copy link

smnbbrv commented May 24, 2020

I tried and it worked for these two days. However hangs again now.

@akhfa
Copy link

akhfa commented May 27, 2020

If you come here from Drone CI and their drone plugin, set the MTU that fits you in the settings of the plugin. Probably could save you some hours of debugging and desperate attempts:

kind: pipeline
type: kubernetes
name: default

steps:
  - name: dockerize
    image: plugins/docker
    settings:
      ...
      mtu: 1000

Hi, thanks for this. This help my build. I remember I found article about MTU that maybe useful to give more information
https://medium.com/@liejuntao001/fix-docker-in-docker-network-issue-in-kubernetes-cc18c229d9e5

@slaecker
Copy link

@smnbbrv Thanks a lot for the mtu hint, now drone is finally building ...

@shaopeng-lin
Copy link

shaopeng-lin commented Jun 12, 2020

I have apk fetch hangs indefinitely, it's not because of mtu but network glitch.

My alpine and apk version:
~# cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.11.3
PRETTY_NAME="Alpine Linux v3.11"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"

~# apk --version
apk-tools 2.10.4, compiled for x86_64.

I can reproduce the issue by just shutting down my eth when apk fetch is downloading apks.

~# apk fetch -R linux-lts
Downloading linux-firmware-sun-20191215-r0
Downloading linux-firmware-microchip-20191215-r0
Downloading linux-firmware-rtl_nic-20191215-r0
Downloading xz-libs-5.2.4-r0
Downloading linux-firmware-keyspan-20191215-r0
Downloading linux-firmware-mwlwifi-20191215-r0
Downloading linux-firmware-cpia2-20191215-r0
Downloading libcrypto1.1-1.1.1d-r3
Downloading linux-firmware-ti-connectivity-20191215-r0
Downloading linux-firmware-slicoss-20191215-r0
Downloading linux-firmware-korg-20191215-r0
Downloading linux-firmware-atmel-20191215-r0
Downloading linux-firmware-tehuti-20191215-r0
Downloading linux-firmware-nvidia-20191215-r0
Downloading linux-firmware-netronome-20191215-r0
5% ###########

Then I bring eth back and check network connectivity is fine.
But the apk fetch does not fail nor resume downloading.

UPDATE 2020-06-16:
Since ssbarnea mentioned about IPv6, I checked my environment,
~# sysctl net.ipv6.conf.all.disable_ipv6
net.ipv6.conf.all.disable_ipv6 = 1
And I am running the test from a physical server, not inside container.

@ssbarnea
Copy link

ssbarnea commented Jun 16, 2020

I was able to narrow down the issue and is IPv6. If docker host has IPv6 enabled you are pretty much f**** as apk fetch from inside container will get stuck trying to fetch from dl-cdn.alpinelinux.org which will return "dualstack" results, but we all know that IPv6 does not work in containers.

APK gets fully stuck without ever timing out or trying to to use IPv4 addresses, which will likely work.

That problem is a huge PITA as normal debugging techiniques will not give any usable results:

  • using --network host does not matter
  • using ping or nslookup on dl-cdn.alpinelinux.org from inside container works too
  • Even using wget works (curl is absent from base image)

UPDATE, we have a working hack

I can confirm that https://stackoverflow.com/a/41497555/99834 hack works on both docker and podman, mainly adding --dns-opt='options single-request' --sysctl net.ipv6.conf.all.disable_ipv6=1 when running/building the containers.

@FabioPaes12
Copy link

FabioPaes12 commented Jun 25, 2020

Old problem, but it still happens!

For me, none of the options worked!
I will mention some of the steps that alleviated the problem and allowed me to generate the image, even after 2 or 3 attempts, which is already good, since I could not even generate the image!

1# Repository change, for any mirror, add a RUN line or Joining an existing RUN:
echo "http://dl-4.alpinelinux.org/alpine/v3.12/main" > /etc/apk/repositories \ && apk update ...
The Official List is here: https://mirrors.alpinelinux.org/

2 # The one that best behaved was to change the DNS of the Image, add a RUN line or Joining an existing RUN:
RUN printf "nameserver 208.67.222.222\nnameserver 8.8.4.4\nnameserver 1.1.1.1\nnameserver 9.9.9.9\nnameserver 8.8.8"> /etc/resolv.conf \ && apk update && apk add ...
*** This Line must be included for all RUNs that update.

3 # Change the Docker DNS:
In Ubuntum, just edit the file: / etc / default / docker
Ex: sudo gedit / etc / default / docker
E Include in the file, the Line:
DOCKER_OPTS = "- dns 208.67.222.222 --dns 8.8.8.8 --dns 1.1.1.1 --dns 8.8.4.4 --dns 208.67.220.220 --dns 9.9.9.9"

@pysen
Copy link

pysen commented Aug 24, 2020

Running the offical Drone helm chart on k3os (v0.11) I had to set the MTU to 1450 for my build to finish and not stall on fetching the apkindex.

- name: docker-build
   image: plugins/docker
   settings:
     mtu: 1450

@silviokuehn
Copy link

silviokuehn commented Nov 4, 2020

I had a similar issue. We have a docker-in-docker build container within a Rancher 2 / Kubernetes environment. I had to decrease the MTU of the inner docker service by adding "mtu": 1200 into /etc/docker/daemon.json. The host servers MTU is 1500.

daemon.json

{
    "mtu": 1200
}

@JohnnyElvis
Copy link

I had a similar issue. We have a docker-in-docker build container within a Rancher 2 / Kubernetes environment. I had to decrease the MTU of the inner docker service by adding "mtu": 1200 into /etc/docker/daemon.json. The host servers MTU is 1500.

daemon.json

{
    "mtu": 1200
}

did the trick, thx!

@VengefulAncient
Copy link

VengefulAncient commented Sep 24, 2021

We just got hit by this, running Drone docker plugin in a Kubernetes cluster. Decreasing the MTU to the value used by the eth0 interface on the docker plugin container fixed the issue, thank you so much for sharing this fix.

What I absolutely do not understand is how it worked for almost a year without this workaround. We didn't change anything about our cluster or Drone setup, or the Alpine versions used in our pipelines. If someone has discovered more information about this, please do share.

@the0s
Copy link

the0s commented Sep 25, 2021

After 4 hours of debugging managed to solve this by changing this in the gitlab-ci file:

services:
  name: docker:dind

TO

services:
  - name: docker:dind
    command: ["--mtu=1300"]

source: docker-library/docker#103 (comment)

@heruscode
Copy link

On Codefresh runners you can set mtu in their helm chart values like:

re: 
  dindDaemon:
    mtu: 1400

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests