Error: Only local python files are supported: gs://... #527

paulreimer · 2017-10-18T18:06:43Z

I extended the docker image using the recent spark-2.2.0-k8s-0.4.0-bin-2.7.3 release to add the GCS (Google Cloud Storage) connector.

Observed:
It works great for scala jobs / jars with a gs://<bucket>/ prefix - I see it creates the init container and does populate the spark-files from what was already in GCS. However, when I try to submit a python job (or use --py-files), the spark-submit client does not allow the gs:// prefix and refuses the job.

Error: Only local python files are supported: gs://<my_bucket_name>/pi.py
Run with --help for usage help or --verbose for debug output

Expected:
The job to be allowed by spark-submit, the relevant files populated in an initcontainer, and available for the spark-driver-py and spark-executor-py to use successfully.

(FYI To add the GCS connector, I added these lines to spark-base Dockerfile:)

ENV hadoop_ver 2.7.4
# Add Hadoop 2.x native libs
ADD http://www.us.apache.org/dist/hadoop/common/hadoop-${hadoop_ver}/hadoop-${hadoop_ver}.tar.gz /opt/
RUN cd /opt/ && \
    tar xf hadoop-${hadoop_ver}.tar.gz && \
    ln -s hadoop-${hadoop_ver} hadoop

# Add the GCS connector.
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar ${SPARK_HOME}/jars/

The text was updated successfully, but these errors were encountered:

paulreimer · 2017-10-18T18:16:12Z

I should note that for the GCS connector, I also had to add some runtime config files (notably core-site.xml, start-common.sh which I merged into the this repo's entrypoint.sh), mostly based on https://github.com/kubernetes-incubator/application-images/tree/master/spark

I also had to add :${SPARK_HOME}/conf to SPARK_CLASSPATH in spark-driver-py and spark-executor-py, for it to pick up the core-site.xml.

foxish · 2017-10-18T19:04:50Z

cc @liyinan926
This looks similar to your work with GCS.

liyinan926 · 2017-10-18T19:17:50Z

This should be fixed by adding !isKubernetesCluster to https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L328.

liyinan926 · 2017-10-18T19:41:37Z

BTW: @paulreimer I found that instead of baking in core-site.xml just for configuration (e.g., service account configuration) for the gcs connector, you could pass in the configuration properties using --conf spark.hadoop.[ConfigurationName].

paulreimer · 2017-10-18T19:48:26Z

Interesting, it took me so long to figure out to add ${SPARK_HOME}/conf to the SPARK_CLASSPATH, to get it to pick up core-site.xml, but I also tried to set fs.gs.project.id from the command line and couldn't figure it out. Does your suggestion mean I could use --conf spark.hadoop.fs.gs.project.id (e.g. prefix it with spark.hadoop)? That would have saved me a lot of time.

One nice thing about baking it in though, is the start-common.sh script detects the GCE project name and writes the fs.gs.project.id setting in core-site.xml before starting. That way, at least for GCE clusters, you don't need to submit that info to spark-submit and can re-use the same image (it will work automatically for gs:// URIs, then, assuming the GCE nodes have access to the storage bucket).

paulreimer · 2017-10-18T19:51:25Z

(I was using only GCE resources, and so allowing "application default credentials" to Just Work, instead of manually specifying service accounts.)

liyinan926 · 2017-10-18T20:24:45Z

@paulreimer Yes, you can use --conf spark.hadoop.fs.gs.project.id. Spark will peel off the prefix spark.hadoop.

liyinan926 · 2017-10-18T20:33:27Z

@paulreimer FYI https://github.com/liyinan926/spark-gcp-examples/blob/master/spark-examples/bigquery-wordcount/README.md.

paulreimer · 2017-10-18T20:42:43Z

Sounds good, I will need something like that for the non-GCE clusters.

I was unable to build a working distribution with the !isKubernetesCluster change applied (it did build, but the initcontainer doesn't work.

My build also fails for Scala jobs that worked before with my image with the GCS container added (using the 0.4.0 release jars), so something must be wrong with my build environment (I have never built before). I used build/mvn -T4 -DskipTests package, and I noticed that there are way more jars in the official release tarball than were generated in assembly/target/scala-2.11/jars. I also didn't get a dist tarball at the end of the process, not sure if that is expected.

I would be happy to test updated binaries from a working build, with the !isKubernetesCluster change applied, if anyone else can build them.

liyinan926 · 2017-10-18T20:45:10Z

Try this build command
./dev/make-distribution.sh --pip --tgz -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.7.3.

paulreimer · 2017-10-18T22:13:00Z

Right on, that command worked for me, and the suggested change also worked! I was able to successfully submit my python job, using gs:// on GCE (without a local copy of the file, and without using the resource-staging-server). I also applied the change to the check for R files in the same place in that file.

Note, I only had to replace the spark-submit client binary from my build, I was able to use my existing images -- based on the official 0.4.0 binaries -- with the GCS connector added. Seems it was really just that client check denying the job, the initcontainer part worked smoothly with a gs:// URI.

Thanks so much, I really appreciate your help, @liyinan926 !

liyinan926 · 2017-10-18T22:18:14Z

Cool! Can you submit a PR with the change? Thanks!

… files) when isKubernetes is set (apache-spark-on-k8s#527)

paulreimer added a commit to paulreimer/spark that referenced this issue Oct 18, 2017

Allow specifying non-local files to spark-submit (python files, and R…

79ffa72

… files) when isKubernetes is set (apache-spark-on-k8s#527)

paulreimer mentioned this issue Oct 18, 2017

Allow specifying non-local files to spark-submit (python files, and R files) #530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Only local python files are supported: gs://... #527

Error: Only local python files are supported: gs://... #527

paulreimer commented Oct 18, 2017

paulreimer commented Oct 18, 2017

foxish commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

paulreimer commented Oct 18, 2017

paulreimer commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

paulreimer commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

paulreimer commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

Error: Only local python files are supported: gs://... #527

Error: Only local python files are supported: gs://... #527

Comments

paulreimer commented Oct 18, 2017

paulreimer commented Oct 18, 2017

foxish commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

paulreimer commented Oct 18, 2017

paulreimer commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

paulreimer commented Oct 18, 2017

liyinan926 commented Oct 18, 2017

paulreimer commented Oct 18, 2017

liyinan926 commented Oct 18, 2017