-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Transforming DELA into SQL or JSON #2
Comments
Hi Peter, no, there is no such recipe. In which context does this approach help? |
Hi @eric-laporte, Portuguese is not a "so natural" language, there are "official rules" as the Portuguese Language Orthographic Agreement of 1990's... So, 90% of portuguese-educated-vocabullary is defined "by law", not "by statistics"... Each corpus-dictionary pair must be organized by date-ranges... SQL is the only "universal tool" to manage all the relationships. In my opinion, the SQL management is good for:
In nowadays, with SQL-2016 standard (so with PostgreSQL v9.5+) all the listed features are possible. Unitex-core is perfect for micro-managing and specialist operations, SQL is perfect for managing big-data and macro operations... It is not necessary to "plug" both, because is easy to export/import data automatically in both (!)... But, with little investment, it is also possible to plug by SQL C++ modules (can copy/paste some strategic Unitex-core functions into the database), unifying the systems. 10 years ago we used SQL in São Carlos municipality and in 2009 something (for plurals) in the LexML project... It was a proof of concept for DELA translation and SQL use: I think the results was positive. |
Peter,
|
Peter, one more comment: your proposal is not specific to the language resources of Brazilian Portuguese. The present Unitex/GramLab code for corpora and dictionaries is the same for all languages, except for some configuration files that select some code for some languages. |
I am starting with open data convertions before SQL and/or JSON: any database-user that want to "play" with Unitex dictionary data, need to access it without friction (see FrictionLessData initiative), by simple copy/paste or SQL's "copy from CSV". I converted all to UTF8, that is the Web standard and is easy to manage at Github; them adapted all to a tabular CSV format, that is the most simple and accept format (RFC 4180 and W3C's tabular-data-model standards) for data interchange. I am trying to do this first step here. Answering your comments:
Hum... Ok, I'm alone in this demand: I am abandoning it at this first moment.
The idea is to solve, in a far future, "the problem of the many pt-BR dictionaries", one for each historic period, with well-defined date-range constraints (laws of ortographic reforms). Typical applications are "dated spell checker" (in OCR and digital preservation contexts) or "dated dictionary" (for search optimization).
... Not so huge, as I commented above, need only a "little investment ... copy/paste some strategic Unitex-core functions". The first step is to define a good data model, and is what I want to do next weeks...
yes, need focus in this goals.
Yes, we can extend for others... But the main justification (my rationale above) is that pt-BR is "defined by law". For others, like English (defined by "cultural ecossystem"), the date-range reference is not attractive — for English is difficult to enforce corpus segments or dictionaries into some date range.
Yes, I am also supposing that with pt-BR and all other Indo-European languages can change defaults from UTF-16 to UTF8. Ideal to an open dataset is to adopt its local base-charset by default. ... I will offer a Perl or shell-script that rebuilds the files, all into UTF16 and other Unitex's expected formats. My next step is to understand what is the minimal primary source for a Unitex dictionary... I am supposing that DELAS+DELACF+Inflextions are all (minimal) that we need to generate a complete dictionary (ex. DELAF_PB). |
Hi Peter, |
I'm closing this issue. Transforming DELA into SQL or JSON is not, at least for now, a part of the project's goals. Nevertheless, feel free to open a PR on the core repository if you want to share a such module. |
Hi, I am "new" here: I used Unitex ~10 years ago, and we translated all to SQL, was a good approach to manage big data... There are an "UnitexGramLab recipe" to transform DELA datasets into SQL (PostgreSQL) or JSON?
The text was updated successfully, but these errors were encountered: