Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Question] Got 'Provider "gs" not installed' on Dataproc #1348

Open
allan-silva opened this issue Feb 5, 2024 · 1 comment
Open

[Question] Got 'Provider "gs" not installed' on Dataproc #1348

allan-silva opened this issue Feb 5, 2024 · 1 comment
Labels
api: storage Issues related to the googleapis/java-storage-nio API. priority: p3 Desirable enhancement or fix. May not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. type: question Request for information or clarification. Not an issue.

Comments

@allan-silva
Copy link

Hi, I'm trying use the this lib to access data in a GCS bucket, from Dataproc spark job.

Up to now I try:

  • add this lib as dependency on my scala 2.12 project
    libraryDependencies ++= Seq(
      "com.google.cloud" % "google-cloud-nio" % "0.123.10",
      "org.apache.spark" %% "spark-core" % "3.5.0" % "provided",
      "org.apache.spark" %% "spark-sql" % "3.5.0" % "provided",
      "br.dev.contrib.gov.sus.opendata" % "libdatasus-parquet-dbf" % "1.0.5" % "provided"
    ),
  • Pass com.google.cloud:google-cloud-nio:0.123.10 (the most new version too), as --packages parameter for spark job.
  • Send google-cloud-nio jar via --jars spark parameters
  • even try load SP manually
ServiceLoader.load(classOf[CloudStorageFileSystemProvider])

Reading the README, looks like I need only add this lib as dependency. Is supposed I need to do any other step?

I always got "Provider "gs" not installed'" from dataproc job.

      val sourceFileURI = URI.create(row.getAs[String]("file_uri"))
     ...
      val outputFileURI = URI.create(s"$outputBucket/${sourceFileHadoopPath.getName}.parquet")

      val converter = DbfParquet.builder().build()

      converter.convert(
        Paths.get(sourceFileURI),
        Paths.get(outputFileURI)
      )

Results in:

24/02/05 22:25:29 INFO BigQueryDataSourceReaderContext: Got read session for GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=ingestion_info, projectId=puc-tcc-412315, tableId=_bqc_b89f019aa87446dd960bcb0c3ace5ff2}}: projects/puc-tcc-412315/locations/us/sessions/CAISDGZtVGJxdGlQaTR1QhoCcHoaAnB4 for application id: application_1707171685672_0002
+-------------------------------------------------+------+
|file_uri                                         |source|
+-------------------------------------------------+------+
|gs://informacoes-ambulatoriais-raw/CIHASE1310.dbc|SIA   |
|gs://informacoes-ambulatoriais-raw/CIHADF1206.dbc|SIA   |
+-------------------------------------------------+------+

> ^^^ Files to be processed
24/02/05 22:25:39 INFO ReadSessionCreator: Reusing read session: projects/puc-tcc-412315/locations/us/sessions/CAISDGZtVGJxdGlQaTR1QhoCcHoaAnB4, for table: GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=ingestion_info, projectId=puc-tcc-412315, tableId=_bqc_b89f019aa87446dd960bcb0c3ace5ff2}}
24/02/05 22:25:39 INFO BigQueryDataSourceReaderContext: Got read session for GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=ingestion_info, projectId=puc-tcc-412315, tableId=_bqc_b89f019aa87446dd960bcb0c3ace5ff2}}: projects/puc-tcc-412315/locations/us/sessions/CAISDGZtVGJxdGlQaTR1QhoCcHoaAnB4 for application id: application_1707171685672_0002
24/02/05 22:25:43 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (informacoes-ambulatoriais-r4u3ielni3jbu-w-0.us-central1-c.c.puc-tcc-412315.internal executor 1): java.nio.file.FileSystemNotFoundException: Provider "gs" not installed
	at java.base/java.nio.file.Path.of(Path.java:212)
	at java.base/java.nio.file.Paths.get(Paths.java:97)
	at br.dev.contrib.gov.sus.opendata.jobs.FileConversionJob$.$anonfun$convertFiles$1(FileConversionJob.scala:68)
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/java-storage-nio API. label Feb 5, 2024
@cojenco cojenco added type: question Request for information or clarification. Not an issue. priority: p3 Desirable enhancement or fix. May not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Feb 6, 2024
@cojenco
Copy link
Contributor

cojenco commented Feb 13, 2024

Hi allan-silva@ based on the error message, this seems to be an issue with a dependency missing or not being packaged in the required way. Please check out how similar issues were resolved. Hope these previous discussions will help.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
api: storage Issues related to the googleapis/java-storage-nio API. priority: p3 Desirable enhancement or fix. May not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

2 participants