Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add JsonL-Format as possible logformat #47

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Daveismus
Copy link

This PR adds support for jsonL (or ndjson) logs.
Files are found if they are called with the file ending jsonl.

Currently is the timestamp searched in all top-level keys in the json objects.
Each key is printed to its own line (\n are not escaped). That's the easiest option for showing json objects. In the future it is possible to allow more complex options.

Additionally there is a small example file, which can be read with the new Reader.

@ptmcg
Copy link
Owner

ptmcg commented Feb 10, 2025

Please also add a log9.jsonl file with this content:

{"level": "WARN", "LOG_TIME": "2023-07-14 08:00:01", "message":"Connection lost due to timeout"}
{"level": "CRITICAL", "LOG_TIME": "2023-07-14 08:00:04", "message": "Request processed unsuccessfully", "stacktrace": "Something went wrong\nTraceback (last line is latest):\n    sample.py: line 32\n        divide(100, 0)\n    sample.py: line 8\n        return a / b\nZeroDivisionError: division by zero"}
{"level": "INFO", "LOG_TIME": "2023-07-14 08:00:06", "message": "User authentication failed"}
{"level": "DEBUG", "LOG_TIME": "2023-07-14 08:00:08", "message": "Starting data synchronization"}
{"level": "INFO", "LOG_TIME": "2023-07-14 08:00:11", "message": "Processing incoming request"}
{"level": "INFO", "LOG_TIME": "2023-07-14 08:00:11", "message": "Processing incoming request (a little more...)"}
{"level": "DEBUG", "LOG_TIME": "2023-07-14 08:00:14", "message": "Performing database backup"}
{"level": "WARN", "LOG_TIME": "2023-07-14 08:00:16", "message": "Invalid input received: missing required field"}
{"level": "ERROR", "LOG_TIME": "2023-07-14 08:00:19", "message": "Failed to connect to remote server"}
{"level": "INFO", "LOG_TIME": "2023-07-14 08:00:22", "message": "Sending email notification"}
{"level": "WARN", "LOG_TIME": "2023-07-14 08:00:25", "message": "Slow response time detected"}
{"level": "INFO", "LOG_TIME": "2023-07-14 08:00:27", "message": "Data synchronization completed"}

to test out your logic for detecting the timestamp key, even when

  • it is not the first key in the JSON
  • it is not named 'timestamp'

EDIT: also please add lines that are problematic:

  • timestamp is an empty string
  • timestamp key is missing

@ptmcg
Copy link
Owner

ptmcg commented Feb 10, 2025

Overall, this looks very good. I like your adaptive timestamp detection and removal from the logged content.

If you just remove the print statement and add the extra test file, I'll go ahead and merge this. I'll take care of updating the changelog and other overhead bits (docs for describing supported file types, etc.).

Also, I'll add support for detecting the presence of the orjson JSON parser (written in Rust). I've used this on other projects, and it has a number of features that improve on the json module of the stdlib.

EDIT: well, I found a few more things to discuss before merging. See other comments.

@ptmcg ptmcg self-requested a review February 10, 2025 08:18
Copy link
Owner

@ptmcg ptmcg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comments.

@ptmcg
Copy link
Owner

ptmcg commented Feb 10, 2025

This file format poses some interesting issues compared to the other types. Most of the others deal with processing lines of text, so all the fields are strings by default. Now that we are getting data that has been through the JSON parser, it can be any of a number of types (int, float, bool, None, even dict if the JSON is nested), so we can't assume strings for everything.

@Daveismus
Copy link
Author

EDIT: also please add lines that are problematic:

  • timestamp is an empty string
  • timestamp key is missing

What do you think should happen, if no timestamp key is available? I don't think it makes sense to add this row to the previous rows.

@ptmcg
Copy link
Owner

ptmcg commented Feb 17, 2025

I think this may bring up a difference in approach from the log reader that reads loosely-formatted text files. In the case of text file, lines with no timestamp are presumed to be continuations of the last timestamped line (if a traceback gets logged, for instance). With parsed JSON, it is kind of a puzzle, why would we get a line with no timestamp? And what to do with such a line? Dropping it on the floor seems the easiest, but also the least friendly to the user - that line might have important stuff in it. I guess we could just log a warning in that case so it doesn't get lost, but we don't make any extra assumptions about it.

Also, with regard to the code that looks for a new timestamp key if the old one changes to a new one - I'm a little wary of being too helpful there. In the past I have written APIs with similar helpfulness in mind, and I ended up getting tied up in some knots because an API was too flexible, and this feels similar. Do you think this is going to be a common occurrence? Have you seen this in the log files you work with? To begin, I feel we should start strict, and require the key to be the same throughout a given jsonl file, and if this becomes more common, we'll address it in a future version.

@Daveismus
Copy link
Author

Without the dynamic timestamp col search, the row without timestamp will be added to the previous line.

The only times i found changing keys, is when i changed the logging mechanism. But to be honest, it was just bad practice having both versions in the same file.

@Daveismus Daveismus requested a review from ptmcg February 24, 2025 12:33
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants