Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Error: Only local python files are supported: gs://... #527

Open
paulreimer opened this issue Oct 18, 2017 · 12 comments
Open

Error: Only local python files are supported: gs://... #527

paulreimer opened this issue Oct 18, 2017 · 12 comments

Comments

@paulreimer
Copy link

I extended the docker image using the recent spark-2.2.0-k8s-0.4.0-bin-2.7.3 release to add the GCS (Google Cloud Storage) connector.

Observed:
It works great for scala jobs / jars with a gs://<bucket>/ prefix - I see it creates the init container and does populate the spark-files from what was already in GCS. However, when I try to submit a python job (or use --py-files), the spark-submit client does not allow the gs:// prefix and refuses the job.

Error: Only local python files are supported: gs://<my_bucket_name>/pi.py
Run with --help for usage help or --verbose for debug output

Expected:
The job to be allowed by spark-submit, the relevant files populated in an initcontainer, and available for the spark-driver-py and spark-executor-py to use successfully.

(FYI To add the GCS connector, I added these lines to spark-base Dockerfile:)

ENV hadoop_ver 2.7.4
# Add Hadoop 2.x native libs
ADD http://www.us.apache.org/dist/hadoop/common/hadoop-${hadoop_ver}/hadoop-${hadoop_ver}.tar.gz /opt/
RUN cd /opt/ && \
    tar xf hadoop-${hadoop_ver}.tar.gz && \
    ln -s hadoop-${hadoop_ver} hadoop

# Add the GCS connector.
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar ${SPARK_HOME}/jars/
@paulreimer
Copy link
Author

I should note that for the GCS connector, I also had to add some runtime config files (notably core-site.xml, start-common.sh which I merged into the this repo's entrypoint.sh), mostly based on https://github.com/kubernetes-incubator/application-images/tree/master/spark

I also had to add :${SPARK_HOME}/conf to SPARK_CLASSPATH in spark-driver-py and spark-executor-py, for it to pick up the core-site.xml.

@foxish
Copy link
Member

foxish commented Oct 18, 2017

cc @liyinan926
This looks similar to your work with GCS.

@liyinan926
Copy link
Member

@liyinan926
Copy link
Member

BTW: @paulreimer I found that instead of baking in core-site.xml just for configuration (e.g., service account configuration) for the gcs connector, you could pass in the configuration properties using --conf spark.hadoop.[ConfigurationName].

@paulreimer
Copy link
Author

Interesting, it took me so long to figure out to add ${SPARK_HOME}/conf to the SPARK_CLASSPATH, to get it to pick up core-site.xml, but I also tried to set fs.gs.project.id from the command line and couldn't figure it out. Does your suggestion mean I could use --conf spark.hadoop.fs.gs.project.id (e.g. prefix it with spark.hadoop)? That would have saved me a lot of time.

One nice thing about baking it in though, is the start-common.sh script detects the GCE project name and writes the fs.gs.project.id setting in core-site.xml before starting. That way, at least for GCE clusters, you don't need to submit that info to spark-submit and can re-use the same image (it will work automatically for gs:// URIs, then, assuming the GCE nodes have access to the storage bucket).

@paulreimer
Copy link
Author

(I was using only GCE resources, and so allowing "application default credentials" to Just Work, instead of manually specifying service accounts.)

@liyinan926
Copy link
Member

@paulreimer Yes, you can use --conf spark.hadoop.fs.gs.project.id. Spark will peel off the prefix spark.hadoop.

@paulreimer
Copy link
Author

Sounds good, I will need something like that for the non-GCE clusters.

I was unable to build a working distribution with the !isKubernetesCluster change applied (it did build, but the initcontainer doesn't work.

My build also fails for Scala jobs that worked before with my image with the GCS container added (using the 0.4.0 release jars), so something must be wrong with my build environment (I have never built before). I used build/mvn -T4 -DskipTests package, and I noticed that there are way more jars in the official release tarball than were generated in assembly/target/scala-2.11/jars. I also didn't get a dist tarball at the end of the process, not sure if that is expected.

I would be happy to test updated binaries from a working build, with the !isKubernetesCluster change applied, if anyone else can build them.

@liyinan926
Copy link
Member

Try this build command
./dev/make-distribution.sh --pip --tgz -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.7.3.

@paulreimer
Copy link
Author

Right on, that command worked for me, and the suggested change also worked! I was able to successfully submit my python job, using gs:// on GCE (without a local copy of the file, and without using the resource-staging-server). I also applied the change to the check for R files in the same place in that file.

Note, I only had to replace the spark-submit client binary from my build, I was able to use my existing images -- based on the official 0.4.0 binaries -- with the GCS connector added. Seems it was really just that client check denying the job, the initcontainer part worked smoothly with a gs:// URI.

Thanks so much, I really appreciate your help, @liyinan926 !

@liyinan926
Copy link
Member

Cool! Can you submit a PR with the change? Thanks!

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants