Skip to content

Next steps, issues and backlog

jameskrobinson edited this page May 26, 2021 · 7 revisions

Must:

  • Refactor the Saxon usage in the UDTF. It should be in its own class;
  • Add support for date, datetime / timestamp and decimal / money data types;
  • More unit tests, general unit test tidy up;
  • Improved error handling and logging;
  • For XSD -> XSL process, determine data types from the XSD;

Should:

  • Create a single XSD -> XSLT convertor, which works for any flavour of XSD;
  • As part of this, allow the user to stipulate the level of entity decomposition we do. Presently, it's none (in the Summit version) and everything (in the FPML one);
  • Test schema evolution and allow for defaulting. e.g. we get a new XSD, what next?
  • Spark SQL testing. This might require some serialization changes;

Could:

  • A new UDF to live alongside the UDTF, which uses Saxon to validate XML against an XSD;
  • A JOLT version of the UDTF: https://github.com/bazaarvoice/jolt ;
  • Create a set of DDL / view creation scripts based on the XSD;
  • Figure out how basic Data Lineage could be supported;
  • Be able to pass the XSL file or XSD file parameter as a dynamic value. Right now it's a constant. If it was dynamic you could use a lookup table of XML-> XSL file which would be useful for all kinds of things;
  • Similar to above, pass the XSL / XSD as a string, allowing us to store these values in a table rather than on the filesystem;
  • And, similar to the above, check how files can be stored in HDFS / S3 bucket / GS / etc instead of local filesystems;

Would:

  • Performance testing and volume testing... need data and an environment!
  • Test combination of the UDFT, xPath UDF's and XML Serde!
  • Test with various Storage Handlers;
  • Test with Impala / other Hive-like parts of Hadoop;
  • Test with, and make use of, the Enterprise Edition of Saxon;
  • See if I can get it to work with FixML;