-
Notifications
You must be signed in to change notification settings - Fork 0
Next steps, issues and backlog
jameskrobinson edited this page May 26, 2021
·
7 revisions
- Refactor the Saxon usage in the UDTF. It should be in its own class;
- Add support for date, datetime / timestamp and decimal / money data types;
- More unit tests, general unit test tidy up;
- Improved error handling and logging;
- For XSD -> XSL process, determine data types from the XSD;
- Create a single XSD -> XSLT convertor, which works for any flavour of XSD;
- As part of this, allow the user to stipulate the level of entity decomposition we do. Presently, it's none (in the Summit version) and everything (in the FPML one);
- Test schema evolution and allow for defaulting. e.g. we get a new XSD, what next?
- Spark SQL testing. This might require some serialization changes;
- A new UDF to live alongside the UDTF, which uses Saxon to validate XML against an XSD;
- A JOLT version of the UDTF: https://github.com/bazaarvoice/jolt ;
- Create a set of DDL / view creation scripts based on the XSD;
- Figure out how basic Data Lineage could be supported;
- Be able to pass the XSL file or XSD file parameter as a dynamic value. Right now it's a constant. If it was dynamic you could use a lookup table of XML-> XSL file which would be useful for all kinds of things;
- Similar to above, pass the XSL / XSD as a string, allowing us to store these values in a table rather than on the filesystem;
- And, similar to the above, check how files can be stored in HDFS / S3 bucket / GS / etc instead of local filesystems;
- Performance testing and volume testing... need data and an environment!
- Test combination of the UDFT, xPath UDF's and XML Serde!
- Test with various Storage Handlers;
- Test with Impala / other Hive-like parts of Hadoop;
- Test with, and make use of, the Enterprise Edition of Saxon;
- See if I can get it to work with FixML;