CoreNLP fails to serialise with Protobuf in Spark #1311

mkarmona · 2022-10-25T16:16:48Z

Not fully sure it is my fault but this is still an issue indeed; at least in my case. I cannot serialise without having an exception with Spark. Regardless, the version of protobuf is not up to date at least with the branch 3.x

The exception says

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) (10.4.1.12 executor driver): org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function ($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$7743/415629414: (string) => array<tinyint>)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:284)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:761)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:179)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:96)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:889)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1692)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:892)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:747)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.hasExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;)Z @2: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @2
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
  Bytecode:
    0x0000000: 2a2b b600 21ac                         

	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:673)
	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:641)
	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.write(ProtobufAnnotationSerializer.java:184)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDA$2(command-2148266106888542:10)
	at scala.util.Try$.apply(Try.scala:213)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDA$1(command-2148266106888542:6)
	at scala.Option.map(Option.scala:230)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.serialiseTDA(command-2148266106888542:5)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDASpark$1(command-2148266106888542:31)
	... 24 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3257)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3189)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3180)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3180)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1414)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1414)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1414)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3466)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3407)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3395)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1166)
	at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2702)
	at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:292)
	at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:302)
	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:101)
	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:108)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:115)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:104)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:88)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:515)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.computeResult(ResultCacheManager.scala:526)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:388)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:382)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:284)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollectResult$1(SparkPlan.scala:429)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:426)
	at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3423)
	at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3414)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$3(Dataset.scala:4288)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:774)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4286)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:241)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:389)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:187)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:973)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:142)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:339)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4286)
	at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3413)
	at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:267)
	at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:101)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$getResultBufferInternal$3(ScalaDriverLocal.scala:345)
	at scala.Option.map(Option.scala:230)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$getResultBufferInternal$1(ScalaDriverLocal.scala:325)
	at scala.Option.map(Option.scala:230)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.getResultBufferInternal(ScalaDriverLocal.scala:289)
	at com.databricks.backend.daemon.driver.DriverLocal.getResultBuffer(DriverLocal.scala:890)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:267)
	at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$22(DriverLocal.scala:765)
	at com.databricks.unity.EmptyHandle$.runWith(UCSHandle.scala:41)
	at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$20(DriverLocal.scala:750)
	at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:377)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:108)
	at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:375)
	at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:372)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:62)
	at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:420)
	at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:405)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:62)
	at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:728)
	at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:622)
	at scala.util.Try$.apply(Try.scala:213)
	at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:614)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:533)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:568)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:438)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:381)
	at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:232)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function ($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$7743/415629414: (string) => array<tinyint>)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:284)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:761)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:179)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:96)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:889)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1692)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:892)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:747)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.hasExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;)Z @2: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @2
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
  Bytecode:
    0x0000000: 2a2b b600 21ac                         

	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:673)
	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:641)
	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.write(ProtobufAnnotationSerializer.java:184)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDA$2(command-2148266106888542:10)
	at scala.util.Try$.apply(Try.scala:213)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDA$1(command-2148266106888542:6)
	at scala.Option.map(Option.scala:230)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.serialiseTDA(command-2148266106888542:5)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDASpark$1(command-2148266106888542:31)
	... 24 more

I am trying to serialise the (tokens, lemmas and deparse) to reuse it later against semgrex.

Spark 3.3.0 on Azure Databricks

The text was updated successfully, but these errors were encountered:

mkarmona · 2022-10-25T16:20:48Z

I checked against Databricks runtime 11.3 that contains Hadoop 3.3.4 and it failed.

AngledLuffa · 2022-10-25T21:23:52Z

How certain are you that upgrading the protobuf package would fix this issue?

mkarmona · 2022-10-26T19:32:44Z

@AngledLuffa not at all; it works plain spark 3.3.1 out of databricks env so it can be DB's fault indeed. If I can reformulate my question, what is the easiest approach to serialising (not protobuf) the indexed words (with lemmas and pos) (sentences) and the dependency parsing into XML or JSON just to load it back again and do semgrex to it? is there any file you could point me to, even if I have to code something on my side? The main point here for me is to save to recompute it again when rules change.

AngledLuffa · 2022-10-28T00:13:32Z

Give me a few days and I will address this - there is a deadline early next week

…

On Wed, Oct 26, 2022 at 12:32 PM Miguel Carmona ***@***.***> wrote: @AngledLuffa <https://github.com/AngledLuffa> not at all. If I can reformulate my question, what is the easiest approach to serialising (not protobuf) the indexed words (sentences) and the dependency parsing into XML or JSON just to load it back again and do semgrex to it? is there any file you could point me to, even if I have to code something on my side? The main point here for me is to save to recompute it again when rules change. — Reply to this email directly, view it on GitHub <#1311 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWN7J7PJJY7A3MNMZLDWFGBORANCNFSM6AAAAAAROEM3Z4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

AngledLuffa · 2022-11-04T06:11:30Z

If I update the version of protobuf we use and send you a zip file as a fake release, are you able to test that, or are you only able to test Maven releases?

…

On Thu, Oct 27, 2022 at 5:13 PM John Bauer ***@***.***> wrote: Give me a few days and I will address this - there is a deadline early next week On Wed, Oct 26, 2022 at 12:32 PM Miguel Carmona ***@***.***> wrote: > @AngledLuffa <https://github.com/AngledLuffa> not at all. If I can > reformulate my question, what is the easiest approach to serialising (not > protobuf) the indexed words (sentences) and the dependency parsing into XML > or JSON just to load it back again and do semgrex to it? is there any file > you could point me to, even if I have to code something on my side? The > main point here for me is to save to recompute it again when rules change. > > — > Reply to this email directly, view it on GitHub > <#1311 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AA2AYWN7J7PJJY7A3MNMZLDWFGBORANCNFSM6AAAAAAROEM3Z4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

AngledLuffa · 2022-11-04T06:46:05Z

Looking over the difference in protoc, I think updating from 3.19.2 to 3.19.6 will not make a difference for your case. I did it anyway in the dev branch, since github was complaining about the dependency.

Having said that, searching on StackOverflow for this particular error makes me think there is a missed compiler error somewhere... not sure where, though.

https://stackoverflow.com/questions/30365106/reason-for-the-exception-java-lang-verifyerror-bad-type-on-operand-stack

You asked about a protobuf format suitable for semgrex requests. As it turns out, all you need is the tokens (with all their attributes) and the dependency graph, right? That exists in CoreNLP.proto: SemgrexRequest You would need to serialize that yourself, though, I believe. The ProtobufAnnotationSerializer.java methods public CoreNLPProtos.Token toProto(CoreLabel coreLabel) and public static CoreNLPProtos.DependencyGraph toProto(SemanticGraph graph) would do most of that work for you, but there's no wrapper which does a list of sentences and a list of graphs -> SemgrexRequest. If that works for you, and you're able to add some code to build the requests, we'd be happy to accept a PR.

mkarmona · 2023-01-05T15:19:12Z

@AngledLuffa thanks for the indications. I learnt from toProto (this) and from fromProto (this) functions for SemanticGraph to implement my custom de/serialisation without ProtoBuf for Spark. So now, I am able to restore the SemanticGraph and check against any amount of rules at scale.

There are three main inner serialisations needed to be able to deserialise a CoreNLP Semantic Graph

tokens (CoreLabel)
edges (SemanticGraphEdge)
roots (Int)

I keep tokens because for edges and roots I just store token indices. A string representation of a deserialised semantic graph for a random sentence makes as simple test on my side.

In adults, FMRFamide is primarily transcribed in the head and thorax, and FMRFamideR is primarily transcribed in the thorax.

[transcribed/VBN
  obl:in>[adults/NNS case>In/IN]
  punct>,/,
  nsubj:pass>FMRFamide/NNP
  aux:pass>is/VBZ
  advmod>primarily/RB
  obl:in>[head/NN case>in/IN det>the/DT conj:and>[thorax/NN cc>and/CC]]
  obl:in>[thorax/NN cc>and/CC]
  punct>,/,
  conj:and>[transcribed/VBN
            cc>and/CC
            nsubj:pass>FMRFamideR/NNP
            aux:pass>is/VBZ
            advmod>primarily/RB
            obl:in>[thorax/NN case>in/IN det>the/DT]]
  punct>./.]

AngledLuffa · 2023-01-05T17:26:54Z

Were you able to figure out a root cause for the problem?

mkarmona · 2023-01-06T10:55:22Z

@AngledLuffa I didn't dig for it further. The Databricks platform has old dependencies so it might take me more time than I would expect to try to find the cause root.

mkarmona closed this as completed Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreNLP fails to serialise with Protobuf in Spark #1311

CoreNLP fails to serialise with Protobuf in Spark #1311

mkarmona commented Oct 25, 2022 •

edited

Loading

mkarmona commented Oct 25, 2022 •

edited

Loading

AngledLuffa commented Oct 25, 2022

mkarmona commented Oct 26, 2022 •

edited

Loading

AngledLuffa commented Oct 28, 2022 via email

AngledLuffa commented Nov 4, 2022 via email

AngledLuffa commented Nov 4, 2022

mkarmona commented Jan 5, 2023 •

edited

Loading

AngledLuffa commented Jan 5, 2023

mkarmona commented Jan 6, 2023

CoreNLP fails to serialise with Protobuf in Spark #1311

CoreNLP fails to serialise with Protobuf in Spark #1311

Comments

mkarmona commented Oct 25, 2022 • edited Loading

mkarmona commented Oct 25, 2022 • edited Loading

AngledLuffa commented Oct 25, 2022

mkarmona commented Oct 26, 2022 • edited Loading

AngledLuffa commented Oct 28, 2022 via email

AngledLuffa commented Nov 4, 2022 via email

AngledLuffa commented Nov 4, 2022

mkarmona commented Jan 5, 2023 • edited Loading

AngledLuffa commented Jan 5, 2023

mkarmona commented Jan 6, 2023

mkarmona commented Oct 25, 2022 •

edited

Loading

mkarmona commented Oct 25, 2022 •

edited

Loading

mkarmona commented Oct 26, 2022 •

edited

Loading

mkarmona commented Jan 5, 2023 •

edited

Loading