-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
CoreNLP fails to serialise with Protobuf in Spark #1311
Comments
I checked against Databricks runtime 11.3 that contains Hadoop 3.3.4 and it failed. |
How certain are you that upgrading the protobuf package would fix this issue? |
@AngledLuffa not at all; it works plain spark 3.3.1 out of databricks env so it can be DB's fault indeed. If I can reformulate my question, what is the easiest approach to serialising (not protobuf) the indexed words (with lemmas and pos) (sentences) and the dependency parsing into XML or JSON just to load it back again and do semgrex to it? is there any file you could point me to, even if I have to code something on my side? The main point here for me is to save to recompute it again when rules change. |
Give me a few days and I will address this - there is a deadline early next
week
…On Wed, Oct 26, 2022 at 12:32 PM Miguel Carmona ***@***.***> wrote:
@AngledLuffa <https://github.com/AngledLuffa> not at all. If I can
reformulate my question, what is the easiest approach to serialising (not
protobuf) the indexed words (sentences) and the dependency parsing into XML
or JSON just to load it back again and do semgrex to it? is there any file
you could point me to, even if I have to code something on my side? The
main point here for me is to save to recompute it again when rules change.
—
Reply to this email directly, view it on GitHub
<#1311 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWN7J7PJJY7A3MNMZLDWFGBORANCNFSM6AAAAAAROEM3Z4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
If I update the version of protobuf we use and send you a zip file as a
fake release, are you able to test that, or are you only able to test Maven
releases?
…On Thu, Oct 27, 2022 at 5:13 PM John Bauer ***@***.***> wrote:
Give me a few days and I will address this - there is a deadline early
next week
On Wed, Oct 26, 2022 at 12:32 PM Miguel Carmona ***@***.***>
wrote:
> @AngledLuffa <https://github.com/AngledLuffa> not at all. If I can
> reformulate my question, what is the easiest approach to serialising (not
> protobuf) the indexed words (sentences) and the dependency parsing into XML
> or JSON just to load it back again and do semgrex to it? is there any file
> you could point me to, even if I have to code something on my side? The
> main point here for me is to save to recompute it again when rules change.
>
> —
> Reply to this email directly, view it on GitHub
> <#1311 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AA2AYWN7J7PJJY7A3MNMZLDWFGBORANCNFSM6AAAAAAROEM3Z4>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Looking over the difference in protoc, I think updating from 3.19.2 to 3.19.6 will not make a difference for your case. I did it anyway in the dev branch, since github was complaining about the dependency. Having said that, searching on StackOverflow for this particular error makes me think there is a missed compiler error somewhere... not sure where, though. You asked about a protobuf format suitable for semgrex requests. As it turns out, all you need is the tokens (with all their attributes) and the dependency graph, right? That exists in |
@AngledLuffa thanks for the indications. I learnt from There are three main inner serialisations needed to be able to deserialise a CoreNLP Semantic Graph
I keep tokens because for edges and roots I just store token indices. A
|
Were you able to figure out a root cause for the problem? |
@AngledLuffa I didn't dig for it further. The Databricks platform has old dependencies so it might take me more time than I would expect to try to find the cause root. |
Not fully sure it is my fault but this is still an issue indeed; at least in my case. I cannot serialise without having an exception with Spark. Regardless, the version of protobuf is not up to date at least with the branch 3.x
The exception says
I am trying to serialise the (tokens, lemmas and deparse) to reuse it later against
semgrex
.Spark 3.3.0 on Azure Databricks
The text was updated successfully, but these errors were encountered: