Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bugfix] Improve Exception handling #23

Merged
merged 1 commit into from
Jul 27, 2016

Conversation

sakama
Copy link
Contributor

@sakama sakama commented Jul 27, 2016

Current implementation throws NumberFormatException if long value comes at column defined as double column.

In case of using Embulk from other program (e.g. EmbulkEmbed), program couldn't detect if Exception is a retryable Exception or not. And program will retry infinitely.

I changed code to throw subclass of DataException if invalid value comes.
DataException and ConfigException isn't a retryable Exception at Embulk, so program will be able to stop retrying.

Appendix

I tested with following config.yml and TSV file.

config.yml
in:
  type: file
  path_prefix: /path/to/test.tsv
  parser:
    type: csv
    delimiter: "\t"
    skip_header_lines: 1
    columns:
    - {name: id, type: long}
    - {name: json, type: json}
filters:
  - type: expand_json
    json_column_name: json_payload
    stop_on_invalid_record: 1
    root: "$."
    expanded_columns:
      - {name: "string", type: string}
      - {name: "boolean", type: boolean}

      - {name: "double", type: long} # invalid (expected DataException)
      #- {name: "double", type: double} # valid

      - {name: "long", type: double} # invalid (expected DataException)
      #- {name: "long", type: long} # valid

      - {name: "timestamp", type: long} # invalid (expected DataException)
      #- {name: "timestamp", type: timestamp, format: "%Y-%m-%d %H:%M:%S"} # valid

      #- {name: "json", type: double} # invalid (expected DataException)
      - {name: "json", type: json} # valid
out:
  type: stdout
test.tsv
id  json_payload
1   {"string":"abcde","boolean":1,"double":1.23,"long":10023,"timestamp":"2015-10-07 20:23:57","json":{"k1":"v1"}}
2   {"string":"fghij","boolean":0,"double":312.5,"long":20000,"timestamp":"2015-11-17 11:31:12","json":{"k1":"v1"}}
3   {"string":"klmno","boolean":1,"double":3.1415,"long":31200,"timestamp":"2015-12-13 13:25:32","json":{"k2":1.23}}
4   {"string":"pqrst","boolean":1,"double":2.34,"long":51000,"timestamp":"2016-01-03 23:41:33","json":{"k2":3.14}}
5   {"string":"uvwxy","boolean":0,"double":5.12,"long":43000,"timestamp":"2016-03-15 12:11:05","json":{"k3":1}}

@coveralls
Copy link

coveralls commented Jul 27, 2016

Coverage Status

Coverage increased (+0.8%) to 96.035% when pulling a8095b9 on sakama:improve-exception-handling into 1f19d6b on civitaspo:master.

@civitaspo
Copy link
Member

Could you paste the results before and after?

@civitaspo
Copy link
Member

I think this p-r has no problem, but I want to consider that which is better.

  1. when a value is invalid, then this plugin regard the record as invalid.
  2. when a value is invalid, then this plugin regard the only value as invalid.

If the latter is better, I think one more options should be added like null_if_invalid_value because the latter case cannot be supported by stop_on_invalid_record. (Of course this is implemented on another p-r, if needed.)
How do you think?
I think the latter is better.

@sakama
Copy link
Contributor Author

sakama commented Jul 27, 2016

Could you paste the results before and after?

Before

stop_on_invalid_record: true

Plugin throws NumberFormatException and faild.

$ embulk run /path/to/config.yml
org.embulk.exec.PartialExecutionException: java.lang.NumberFormatException: For input string: "{"k1":"v1"}"
...
Caused by: java.lang.NumberFormatException: For input string: "{"k1":"v1"}"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at org.embulk.filter.expand_json.FilteredPageOutput.setExpandedJsonColumns(FilteredPageOutput.java:311)
    at org.embulk.filter.expand_json.FilteredPageOutput.add(FilteredPageOutput.java:214)
    at org.embulk.exec.LocalExecutorPlugin$ScatterTransactionalPageOutput$OutputWorker.call(LocalExecutorPlugin.java:394)
    at org.embulk.exec.LocalExecutorPlugin$ScatterTransactionalPageOutput$OutputWorker.call(LocalExecutorPlugin.java:319)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Error: java.lang.NumberFormatException: For input string: "{"k1":"v1"}"

stop_on_invalid_record: false

Actual: Same result as stop_on_invlaid_record: true
Expected: plugin only show warning messages and job will succeeded.

After

stop_on_invalid_record: true

Plugin throws DataException and faild.

$ embulk run /path/to/config.yml
org.embulk.exec.PartialExecutionException: org.embulk.spi.DataException: Found an invalid record
    ...
Caused by: org.embulk.spi.DataException: Found an invalid record
    at org.embulk.filter.expand_json.FilteredPageOutput.add(FilteredPageOutput.java:222)
    at org.embulk.exec.LocalExecutorPlugin$ScatterTransactionalPageOutput$OutputWorker.call(LocalExecutorPlugin.java:394)
    at org.embulk.exec.LocalExecutorPlugin$ScatterTransactionalPageOutput$OutputWorker.call(LocalExecutorPlugin.java:319)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.embulk.filter.expand_json.FilteredPageOutput$JsonValueInvalidException: Failed to parse '{"k1":"v1"}' as double
    at org.embulk.filter.expand_json.FilteredPageOutput.setExpandedJsonColumns(FilteredPageOutput.java:317)
    at org.embulk.filter.expand_json.FilteredPageOutput.add(FilteredPageOutput.java:216)
    ... 6 more
Caused by: java.lang.NumberFormatException: For input string: "{"k1":"v1"}"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at org.embulk.filter.expand_json.FilteredPageOutput.setExpandedJsonColumns(FilteredPageOutput.java:314)
    ... 7 more

Error: org.embulk.spi.DataException: Found an invalid record

stop_on_invalid_record: false

Plugin shows warnings but successfully executed.

$ embulk run /path/to/config.yml
...
2016-07-27 13:28:10.482 +0900 [WARN] (embulk-output-executor-0): Skipped an invalid record (Failed to parse '{"k1":"v1"}' as double)
2016-07-27 13:28:10.487 +0900 [WARN] (embulk-output-executor-0): Skipped an invalid record (Failed to parse '{"k1":"v1"}' as double)
2016-07-27 13:28:10.490 +0900 [WARN] (embulk-output-executor-0): Skipped an invalid record (Failed to parse '{"k2":1.23}' as double)
2016-07-27 13:28:10.496 +0900 [WARN] (embulk-output-executor-0): Skipped an invalid record (Failed to parse '{"k2":3.14}' as double)
2016-07-27 13:28:10.502 +0900 [WARN] (embulk-output-executor-0): Skipped an invalid record (Failed to parse '{"k3":1}' as double)
2016-07-27 13:28:10.504 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2016-07-27 13:28:10.514 +0900 [INFO] (main): Committed.
2016-07-27 13:28:10.515 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"/path/to/test.tsv"},"out":{}}

@sakama
Copy link
Contributor Author

sakama commented Jul 27, 2016

  1. when a value is invalid, then this plugin regard the record as invalid.
  2. when a value is invalid, then this plugin regard the only value as invalid.

In my usecase, I think 1 is better.
We're dealing with record as minimum unit at our embulk environment.
If we deal with value as minimum unit, Retry logic(including both of automatically and manually) must become too complex.

@civitaspo
Copy link
Member

If we deal with value as minimum unit, Retry logic(including both of automatically and manually) must become too complex.

ok. I got it. Actually I do not need that option in my case. I had just thought that that option was useful in some cases (like when importing from schema-less datastore and sometimes some values of the data are invalid). If someone wants, we consider this again.

@civitaspo civitaspo merged commit 39a7c2e into embulk:master Jul 27, 2016
@civitaspo
Copy link
Member

@sakama sakama deleted the improve-exception-handling branch July 27, 2016 05:58
@sakama
Copy link
Contributor Author

sakama commented Jul 27, 2016

Thanks 👍

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants