Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CoreNLP fails to serialise with Protobuf in Spark #1311

Closed
mkarmona opened this issue Oct 25, 2022 · 9 comments
Closed

CoreNLP fails to serialise with Protobuf in Spark #1311

mkarmona opened this issue Oct 25, 2022 · 9 comments

Comments

@mkarmona
Copy link

mkarmona commented Oct 25, 2022

Not fully sure it is my fault but this is still an issue indeed; at least in my case. I cannot serialise without having an exception with Spark. Regardless, the version of protobuf is not up to date at least with the branch 3.x

Screenshot 2022-10-25 at 17 06 54

The exception says

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) (10.4.1.12 executor driver): org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function ($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$7743/415629414: (string) => array<tinyint>)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:284)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:761)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:179)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:96)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:889)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1692)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:892)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:747)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.hasExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;)Z @2: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @2
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
  Bytecode:
    0x0000000: 2a2b b600 21ac                         

	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:673)
	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:641)
	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.write(ProtobufAnnotationSerializer.java:184)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDA$2(command-2148266106888542:10)
	at scala.util.Try$.apply(Try.scala:213)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDA$1(command-2148266106888542:6)
	at scala.Option.map(Option.scala:230)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.serialiseTDA(command-2148266106888542:5)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDASpark$1(command-2148266106888542:31)
	... 24 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3257)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3189)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3180)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3180)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1414)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1414)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1414)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3466)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3407)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3395)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1166)
	at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2702)
	at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:292)
	at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:302)
	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:101)
	at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:108)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:115)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:104)
	at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:88)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:515)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.computeResult(ResultCacheManager.scala:526)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:388)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:382)
	at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:284)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollectResult$1(SparkPlan.scala:429)
	at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:426)
	at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3423)
	at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3414)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$3(Dataset.scala:4288)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:774)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4286)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:241)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:389)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:187)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:973)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:142)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:339)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4286)
	at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3413)
	at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:267)
	at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:101)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$getResultBufferInternal$3(ScalaDriverLocal.scala:345)
	at scala.Option.map(Option.scala:230)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$getResultBufferInternal$1(ScalaDriverLocal.scala:325)
	at scala.Option.map(Option.scala:230)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.getResultBufferInternal(ScalaDriverLocal.scala:289)
	at com.databricks.backend.daemon.driver.DriverLocal.getResultBuffer(DriverLocal.scala:890)
	at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:267)
	at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$22(DriverLocal.scala:765)
	at com.databricks.unity.EmptyHandle$.runWith(UCSHandle.scala:41)
	at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$20(DriverLocal.scala:750)
	at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:377)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:108)
	at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:375)
	at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:372)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:62)
	at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:420)
	at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:405)
	at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:62)
	at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:728)
	at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:622)
	at scala.util.Try$.apply(Try.scala:213)
	at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:614)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:533)
	at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:568)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:438)
	at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:381)
	at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:232)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function ($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$7743/415629414: (string) => array<tinyint>)
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:284)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:761)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:179)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:168)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:136)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:96)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:889)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1692)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:892)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:747)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.hasExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;)Z @2: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @2
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
  Bytecode:
    0x0000000: 2a2b b600 21ac                         

	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProtoBuilder(ProtobufAnnotationSerializer.java:673)
	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.toProto(ProtobufAnnotationSerializer.java:641)
	at edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer.write(ProtobufAnnotationSerializer.java:184)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDA$2(command-2148266106888542:10)
	at scala.util.Try$.apply(Try.scala:213)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDA$1(command-2148266106888542:6)
	at scala.Option.map(Option.scala:230)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.serialiseTDA(command-2148266106888542:5)
	at $line8375436040684b87ae4990b863aa80eb37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.$anonfun$serialiseTDASpark$1(command-2148266106888542:31)
	... 24 more

I am trying to serialise the (tokens, lemmas and deparse) to reuse it later against semgrex.

Spark 3.3.0 on Azure Databricks

@mkarmona
Copy link
Author

mkarmona commented Oct 25, 2022

I checked against Databricks runtime 11.3 that contains Hadoop 3.3.4 and it failed.

@AngledLuffa
Copy link
Contributor

How certain are you that upgrading the protobuf package would fix this issue?

@mkarmona
Copy link
Author

mkarmona commented Oct 26, 2022

@AngledLuffa not at all; it works plain spark 3.3.1 out of databricks env so it can be DB's fault indeed. If I can reformulate my question, what is the easiest approach to serialising (not protobuf) the indexed words (with lemmas and pos) (sentences) and the dependency parsing into XML or JSON just to load it back again and do semgrex to it? is there any file you could point me to, even if I have to code something on my side? The main point here for me is to save to recompute it again when rules change.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Oct 28, 2022 via email

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Nov 4, 2022 via email

@AngledLuffa
Copy link
Contributor

Looking over the difference in protoc, I think updating from 3.19.2 to 3.19.6 will not make a difference for your case. I did it anyway in the dev branch, since github was complaining about the dependency.

Having said that, searching on StackOverflow for this particular error makes me think there is a missed compiler error somewhere... not sure where, though.

https://stackoverflow.com/questions/30365106/reason-for-the-exception-java-lang-verifyerror-bad-type-on-operand-stack

You asked about a protobuf format suitable for semgrex requests. As it turns out, all you need is the tokens (with all their attributes) and the dependency graph, right? That exists in CoreNLP.proto: SemgrexRequest You would need to serialize that yourself, though, I believe. The ProtobufAnnotationSerializer.java methods public CoreNLPProtos.Token toProto(CoreLabel coreLabel) and public static CoreNLPProtos.DependencyGraph toProto(SemanticGraph graph) would do most of that work for you, but there's no wrapper which does a list of sentences and a list of graphs -> SemgrexRequest. If that works for you, and you're able to add some code to build the requests, we'd be happy to accept a PR.

@mkarmona
Copy link
Author

mkarmona commented Jan 5, 2023

@AngledLuffa thanks for the indications. I learnt from toProto (this) and from fromProto (this) functions for SemanticGraph to implement my custom de/serialisation without ProtoBuf for Spark. So now, I am able to restore the SemanticGraph and check against any amount of rules at scale.

There are three main inner serialisations needed to be able to deserialise a CoreNLP Semantic Graph

  • tokens (CoreLabel)
  • edges (SemanticGraphEdge)
  • roots (Int)

I keep tokens because for edges and roots I just store token indices. A string representation of a deserialised semantic graph for a random sentence makes as simple test on my side.

In adults, FMRFamide is primarily transcribed in the head and thorax, and FMRFamideR is primarily transcribed in the thorax.

[transcribed/VBN
  obl:in>[adults/NNS case>In/IN]
  punct>,/,
  nsubj:pass>FMRFamide/NNP
  aux:pass>is/VBZ
  advmod>primarily/RB
  obl:in>[head/NN case>in/IN det>the/DT conj:and>[thorax/NN cc>and/CC]]
  obl:in>[thorax/NN cc>and/CC]
  punct>,/,
  conj:and>[transcribed/VBN
            cc>and/CC
            nsubj:pass>FMRFamideR/NNP
            aux:pass>is/VBZ
            advmod>primarily/RB
            obl:in>[thorax/NN case>in/IN det>the/DT]]
  punct>./.]

@mkarmona mkarmona closed this as completed Jan 5, 2023
@AngledLuffa
Copy link
Contributor

Were you able to figure out a root cause for the problem?

@mkarmona
Copy link
Author

mkarmona commented Jan 6, 2023

@AngledLuffa I didn't dig for it further. The Databricks platform has old dependencies so it might take me more time than I would expect to try to find the cause root.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants