Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

unix.c: use posix_fallocate() #409

Merged
merged 1 commit into from
Jul 29, 2020
Merged

Conversation

wladmis
Copy link
Contributor

@wladmis wladmis commented Jul 3, 2020

Using of posix_fallocate() guarantees that, if it succeed, the
attempting to write to allocated space range does not fail because of
lack of storage space. This prevents SIGBUS when trying to write to
mmaped file and no space left.

Co-Authored-by: Ivan Zakharyaschev imz@altlinux.org
Reported-by: Mikhail Kulagin <m.kulagin at postgrespro dot ru>

@knet-ci-bot
Copy link

Can one of the admins verify this patch?

@chrissie-c
Copy link
Contributor

test this please

lib/unix.c Outdated
@@ -138,7 +151,7 @@ qb_sys_mmap_file_open(char *path, const char *file, size_t bytes,
}
free(buffer);
}

#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discovered by @wladmis , a newline was lost unfortunately.

Suggested change
#endif
#endif

Copy link
Contributor

@imz imz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resulting patch is quite elegant and robust; on systems without posix_fallocate, the old code employing write would be at work as a fallback.


A sidenote: We have seen problems on linux because of write not actually allocating disk space. Perhaps, the problem is that linux optimizes write(fd, calloc(1, write_size), write_size) as creating a hole, because it sees that the pointer returned by calloc points to a zeroed page (also as a result of optimizing calloc).


Another optimization idea: further in the code, this fd is mmapped and zeroed with memset before usage. (Something to be checked more accurately!) Now, if we see that write(fd, calloc(1, write_size), write_size) is optimized by linux and works faster than memset (because we got the actual SIGBUS "no space" only when reaching the memset), we could stick to the write(fd, calloc(1, write_size), write_size) method of zeroing instead of memset if zeroing is needed by the logic of the code.

@chrissie-c
Copy link
Contributor

retest this please

@chrissie-c
Copy link
Contributor

I'm not sure how much optimisation that code needs (though I am in favour of posix_fallocate() part, thank you) as it's only called at rb_open() time.

@imz
Copy link
Contributor

imz commented Jul 3, 2020

I'm not sure how much optimisation that code needs (though I am in favour of posix_fallocate() part, thank you) as it's only called at rb_open() time.

Ok, so we shouldn't bother with optimizing memset as something interesting.

Continuing to think about this: As far as I can see now, even doing write after posix_fallocate shouldn't be needed to get the newly allocated blocks be read as zeros. Several sources coincide on saying they will be read as zeros:

The original question says something about writing zeros to the file. None of these calls write anything but metadata. If you read from space that's been preallocated but not yet written, you'll get zeros (not whatever was in that disk space previously, that would be a big security hole). You can only read up to the end of a file (the length, set by fallocate, ftruncate, or various other ways)

When posix_fallocate() is applied to an unallocated region in a regular file (a ``hole''), the hole is filled and the visible contents are unaffected; both holes and newly allocated regions read as all zeros.

The old existing blocks in the file remain unzeroed, though. If we consider qb_sys_mmap_file_open in isolation, can this mean that this patch changes its intended behavior in non-equivalent manner?.. (I.e., posix_fallocate not zeroing the initial bytes.) No, because it is used from qb_rb_open in rungbuffer.c; there the flags are set simultanously:

	if (flags & QB_RB_FLAG_CREATE) {
		file_flags |= O_CREAT | O_TRUNC;
	}

So, the zeroing used to happen when O_CREAT is set, and O_TRUNC is set then always, too. So, first, the file is truncated to zero length, and only then extended by the new blocks.

@imz
Copy link
Contributor

imz commented Jul 3, 2020

The old existing blocks in the file remain unzeroed, though. If we consider qb_sys_mmap_file_open in isolation, can this mean that this patch changes its intended behavior in non-equivalent manner?.. (I.e., posix_fallocate not zeroing the initial bytes.) No, because it is used from qb_rb_open in rungbuffer.c; there the flags are set simultanously:

	if (flags & QB_RB_FLAG_CREATE) {
		file_flags |= O_CREAT | O_TRUNC;
	}

So, the zeroing used to happen when O_CREAT is set, and O_TRUNC is set then always, too. So, first, the file is truncated to zero length, and only then extended by the new blocks.

The only other uses of this function are from ipc_us.c, and there, too, O_CREAT (when zeroing happens) and O_TRUNC got always together:

	fd_hdr = qb_sys_mmap_file_open(path, r->request,
				       SHM_CONTROL_SIZE, O_RDWR);

or

	fd_hdr = qb_sys_mmap_file_open(path, r->request,
				       SHM_CONTROL_SIZE,
				       O_CREAT | O_TRUNC | O_RDWR);

Using of posix_fallocate() guarantees that, if it succeed, the
attempting to write to allocated space range does not fail because of
lack of storage space. This prevents SIGBUS when trying to write to
mmaped file and no space left.

Co-Authored-by: Ivan Zakharyaschev <imz@altlinux.org>
Reported-by: Mikhail Kulagin <m.kulagin at postgrespro dot ru>
Copy link
Contributor

@imz imz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks more safe for the code after posix_fallocate to not invoke any write (as it was already done in the previous revision of the patch) and to not invoke even ftruncate (new in this revision) in order not to get the allocated blocks deallocated by the OS (by creating or "smartly" detecting holes).

@chrissie-c
Copy link
Contributor

retest this please

@chrissie-c chrissie-c merged commit 1c6229c into ClusterLabs:master Jul 29, 2020
@chrissie-c
Copy link
Contributor

Merged. Thanks very much!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants