Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

In read mode, request sincedb disk flush on each read loop iteration. #196

Merged
merged 5 commits into from
Jun 12, 2018

Conversation

guyboertje
Copy link
Contributor

@guyboertje guyboertje commented Jun 6, 2018

The actual write occurs at the sincedb_write interval.

Note: the chunk size vs the events-per-second affect how often a request is made., e.g. a large chunk size (1MB), lines of say 128 bytes and 256 events/s means each iteration takes 32 seconds, and an eps of 32768 takes 250 ms.
With the default chunk size of 32768 bytes and an average line size of 128 bytes a pipeline running at 256 events/s will iterate every 1 second - but with a 32768 event/s the loop takes 7.8125 ms or 128 noop calls to request_disk_flush every second.

Fixes #195

The actual write occurs at the `sincedb_write interval`.
Note: the chunk size vs the events-per-second affect how often a request
is made.
e.g. a large chunk size (1MB), lines of say 128 bytes and 256 events/s
means each iteration takes 32 seconds, and an eps of 32768 takes 250 ms.
@@ -28,6 +28,9 @@ def handle_specifically(watched_file)
watched_file.file_close
# unset_watched_file will set sincedb_value.position to be watched_file.bytes_read
sincedb_collection.unset_watched_file(watched_file)
# we know we're done, we should not wait until the discovery of a
# new file (or LS shutdown) to write out the sincedb
sincedb_collection.immediate_write
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're now guaranteed to write the sincedb at exit, should we still keep this change of forcing sincedb writing at EOF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so.
If not then consider the case where one has say 5 small files and a default sincedb write interval of 15 seconds:

  • Each file has 10 read loops, 10 request_disk_flush were made and none are granted as all files are processed in less than 15 seconds.
  • The 11th loop is an immediate EOF.
  • No further write requests are made, only LS exit will force the write. This is a vulnerable situation, say in a docker container (with persistent disks) that is killed (because we don't have the read all files and exit feature yet).
  • The user is forced to understand enough to set a very low sincedb write interval
  • The discover/stat loop continues to discover and stat though .

Before I had a nagging feeling about changing the request_disk_flush into setting a flag to true before checking the interval. This flag is set to false when the file is written. Then adding a call in the iterate_on_state method to disk_flush_if_needed (if the flag is set and the interval is exceeded).
I could not see the logic before but now after these changes and the SDH I feel this is a better change.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the scenario, but IMO it is an edge case that could be worked around by terminating logstash gracefully before destroying the docker container.

Thinking the other way around, what about the downside to this? Calling a sincedb write on every EOF can be slow in scenarios of multiple small files, creating a large sincedb that needs to be rewritten on each EOF.

Do think the tradeoff is worth it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well my alternate suggestion (the nagging feeling) would not call immediate_write at EOF.

How about I implement it and you can critique again?

@jsvd
Copy link
Member

jsvd commented Jun 11, 2018

There seems to be a failing spec:

  1) LogStash::Inputs::File testing with new, register, run and stop when #run is called multiple times should only actually open files when content changes are detected
     Failure/Error: subject { described_class.new(conf) }
     
     LogStash::ConfigurationError:
       Something is wrong with your configuration.
     # /home/travis/build/logstash/logstash-core/lib/logstash/config/mixin.rb:86:in `config_init'
     # /home/travis/build/logstash/logstash-core/lib/logstash/inputs/base.rb:60:in `initialize'
     # ./spec/inputs/file_tail_spec.rb:283:in `block in subject'
     # ./spec/inputs/file_tail_spec.rb:301:in `block in (root)'

@guyboertje
Copy link
Contributor Author

I don't know why that test failed. I re-ran it and it is OK now

Copy link
Member

@jsvd jsvd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚢 it

@guyboertje guyboertje merged commit 323e669 into logstash-plugins:master Jun 12, 2018
@guyboertje guyboertje deleted the fix/195 branch June 13, 2018 10:40
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants