Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca) #529

Closed
JakeBickUKGWA opened this issue Mar 29, 2022 · 1 comment

Comments

@JakeBickUKGWA
Copy link

Describe the bug
I am working through the AUT walkthrough at: https://aut.docs.archivesunleashed.org/docs/toolkit-walkthrough. I used the installation instructions at: https://github.com/archivesunleashed/docker-aut#build-and-run. I am using an ubuntu-based EC2 instance.

I can run the first step ok and get the count of domains in your sample material (though it does give a few errors at the beginning).

But if I try the second step to extract text it just gives me this error message:

java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.Set$Set1 Set(liberal.ca)
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:90)
at org.apache.spark.sql.catalyst.expressions.Literal$.$anonfun$create$2(literals.scala:152)
at scala.util.Failure.getOrElse(Try.scala:222)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:152)
at org.apache.spark.sql.functions$.typedLit(functions.scala:131)
at org.apache.spark.sql.functions$.lit(functions.scala:114)
... 59 elided

I've copied the full terminal content to the attached text file.

To Reproduce
Steps to reproduce the behavior (e.g.):

  1. Start AUT (I use sudo docker run --rm -it -v "/home/jbickford/Desktop/AUTdata:/data" aut)
  2. In paste mode, run
import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .all()
  .keepValidPagesDF()
  .groupBy(extractDomain($"url").alias("domain"))
  .count()
  .sort($"count".desc)
  .show(10, false)
  1. This gives some errors, but as expected generates a table of the top domains in the sample collection
  2. Again in paste mode, run
import io.archivesunleashed._
import io.archivesunleashed.udfs._

val domains = Set("liberal.ca")

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .webpages()
  .select($"crawl_date", extractDomain($"url").alias("domain"), $"url", $"content")
  .filter(hasDomains($"domain", lit(domains)))
  .write.csv("/data/liberal-party-text")

This generates the java.lang.RuntimeException error mentioned above.

Expected behavior
As I understand it AUT should generate a folder called liberal-party-text, containing extracted text files from the sample data.

Screenshots
Attached
error

Environment information

  • AUT version: I'm afraid I'm not sure, it's the version in the docker image in the walkthrough
  • OS: Ubuntu 20.04.4 LTS (in an EC2 instance)
  • Java version: OpenJDK 64-Bit Server VM, Java 11.0.14.1
  • Apache Spark version: 3.11
  • Apache Spark w/aut: sorry, I'm also unsure about this, I'm guessing it's determined by the docker image, but if not let me know
  • Apache Spark command used to run AUT: sudo docker run --rm -it -v "/home/jbickford/Desktop/AUTdata:/data" aut

AUTissue.txt

ruebot added a commit to archivesunleashed/aut-docs that referenced this issue Mar 29, 2022
@ruebot
Copy link
Member

ruebot commented Mar 29, 2022

@JakeBickUKGWA sorry about that, it was a documentation issue. I forgot to update the type used for the variable. It should be Array not Set. The documentation has been updated: https://aut.docs.archivesunleashed.org/docs/toolkit-walkthrough#extracting-some-text

@ruebot ruebot closed this as completed Mar 29, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants