aboutsummaryrefslogtreecommitdiff
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
...
* pipespace_new(): decrease uidinfo pipebuf usage if reservation check failedKonstantin Belousov2024-09-201-0/+1
| | | | | | Submitted by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week
* pipe: use pipe subsystem KVA counter instead of pipe_map sizeKonstantin Belousov2024-09-201-3/+2
| | | | | | | to calculate the superuser-reserved amount of the pipe space Sponsored by: The FreeBSD Foundation MFC after: 1 week
* socket: Only log splice structs to ktrace if KTR_STRUCT is configuredMark Johnston2024-09-201-1/+2
| | | | Fixes: a1da7dc1cdad ("socket: Implement SO_SPLICE")
* socket: wrap ktrsplice call with KTRACE ifdefSiva Mahadevan2024-09-201-0/+2
| | | | | | | | | This fixes a build error when the kernel is built without KTRACE support. Reviewed by: emaste, markj Fixes: a1da7dc1cdad ("socket: Implement SO_SPLICE") Pull Request: https://github.com/freebsd/freebsd-src/pull/1426
* pipes: reserve configured percentage of buffers zone to superuserKonstantin Belousov2024-09-201-2/+21
| | | | | | Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D46619
* kernel: add RLIMIT_PIPEBUFKonstantin Belousov2024-09-202-0/+23
| | | | | | Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D46619
* procfs require PRIV_PROC_MEM_WRITE to write memSimon J. Gerraty2024-09-191-1/+3
| | | | | | | | | | Add a priv_check for PRIV_PROC_MEM_WRITE which will be blocked by mac_veriexec if being enforced, unless the process has a maclabel to grant priv. Reviewed by: stevek Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D46692
* pctrie: create iteratorDoug Moore2024-09-131-44/+365
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define a pctrie iterator type. A pctrie iterator is a wrapper around a pctrie that remembers a position in the trie where the last search left off, and where a new search can resume. When the next search is for an item very near in the trie to where the last search left off, iter-based search is faster because instead of starting from the root, the search usually only has to back up one or two steps up the root-to-last-search path to find the branch that leads to the new search target. Every kind of lookup (plain, lookup_ge, lookup_le) that can begin with the trie root can begin with an iterator instead. An iterator can also do a relative search ("look for the item 4 greater than the last item I found") because it remembers where that last search ended. It can also search within limits ("look for the item bigger than this one, but it has to be less than 100"), which can save time when the next item beyond the limits and that is known before we actually know what that item it is. An iterator can also be used to remove an item that has already been found, without having to search for it again. Iterators are vulnerable to unsynchronized data changes. If the iterator is created with a lock held, and that lock is released and acquired again, there's no guarantee that the iterator path remains valid. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D45627
* Revert "Assert that mbufs are writable if we write to them"Kristof Provost2024-09-111-2/+0
| | | | | | | | | | This reverts commit f08247fd888e6f7db0ecf2aaa39377144ac40b4c. This assertion is triggered by ktls_test:ktls_transmit_aes128_cbc_1_0_sha1_control. Remove the assertion until we fully understand why. Sponsored by: Rubicon Communications, LLC ("Netgate")
* Assert that mbufs are writable if we write to themKristof Provost2024-09-111-0/+2
| | | | | | | | m_copyback() modifies the mbuf, so it must be a writable mbuf. Reviewed by: glebius Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D46627
* arm: Assume __ARM_ARCH == 7Andrew Turner2024-09-111-1/+1
| | | | | | | | The only supported 32-bit Arm architecture is Armv7. Remove old checks for earlier architecture revisions. Sponsored by: Arm Ltd Differential Revision: https://reviews.freebsd.org/D45957
* socket: Implement SO_SPLICEMark Johnston2024-09-102-6/+730
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a feature which allows one to splice two TCP sockets together such that data which arrives on one socket is automatically pushed into the send buffer of the spliced socket. This can be used to make TCP proxying more efficient as it eliminates the need to copy data into and out of userspace. The interface is copied from OpenBSD, and this implementation aims to be compatible. Splicing is enabled by setting the SO_SPLICE socket option. When spliced, data that arrives on the receive buffer is automatically forwarded to the other socket. In particular, splicing is a unidirectional operation; to splice a socket pair in both directions, SO_SPLICE needs to be applied to both sockets. More concretely, when setting the option one passes the following struct: struct splice { int fd; off_t max; struct timveval idle; }; where "fd" refers to the socket to which the first socket is to be spliced, and two setsockopt(SO_SPLICE) calls are required to set up a bi-directional splice. select(), poll() and kevent() do not return when data arrives in the receive buffer of a spliced socket, as such data is expected to be removed automatically once space is available in the corresponding send buffer. Userspace can perform I/O on spliced sockets, but it will be unpredictably interleaved with splice I/O. A splice can be configured to unsplice once a certain number of bytes have been transmitted, or after a given time period. Once unspliced, the socket behaves normally from userspace's perspective. The number of bytes transmitted via the splice can be retrieved using getsockopt(SO_SPLICE); this works after unsplicing as well, up until the socket is closed or spliced again. Userspace can also manually trigger unsplicing by splicing to -1. Splicing work is handled by dedicated threads, similar to KTLS. A worker thread is assigned at splice creation time. At some point it would be nice to have a direct dispatch mode, wherein the thread which places data into a receive buffer is also responsible for pushing it into the sink, but this requires tighter integration with the protocol stack in order to avoid reentrancy problems. Currently, sowakeup() and related functions will signal the worker thread assigned to a spliced socket. so_splice_xfer() does the hard work of moving data between socket buffers. Co-authored by: gallatin Reviewed by: brooks (interface bits) MFC after: 3 months Sponsored by: Klara, Inc. Sponsored by: Stormshield Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D46411
* mbuf: improve KASSERT(9) falure messages in the m_apply()Maxim Sobolev2024-09-101-2/+4
| | | | | | | | | - Make less ambiguous; - extend to provide more context for post-mortem. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D43776 MFC after: 2 weeks
* rangeset: speed up range traversalDoug Moore2024-09-091-8/+15
| | | | | | | | | | For rangeset-next search, use exact search rather than greater-than search. Move a bit of the testing logic from the pmap code to the common rangeset code. Reviewed by: kib (previous version) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D46314
* ntptime: Use time_t for tv_sec related variablesSebastian Huber2024-09-061-2/+3
| | | | | | | | | The struct timespec tv_sec member is of type time_t. Make sure that all variables related to this member are of the type time_t. This is important for targets where long is a 32-bit type and time_t a 64-bit type. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/1373
* kvprintf(): Fix '+' conversion handlingSebastian Huber2024-09-061-14/+13
| | | | | | | | | | | | | | | | | For example, printf("%+i", 1) prints "+1". However, kvprintf() did print just "1" for this example. According to PRINTF(3): A sign must always be placed before a number produced by a signed conversion. For "%+r" radix conversions, keep the "+" handling as it is, since this is a non-standard conversion. For "%+p" pointer conversions, continue to ignore the sign modifier to be in line with libc. This change allows to support the ' conversion modifier in the future. Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/1310
* vop_stdadvise(): restore correct handling of length == 0Konstantin Belousov2024-09-051-7/+7
| | | | | | | | | | | | | | | | | Switch to unsigned arithmetic to handle overflow not relying on -fwrap, and specially treat the case of length == 0 from posix_fadvise() which passes OFF_MAX as the end to VOP. There, roundup() overflows and -fwrap causes bend and endn become negative. Using uintmax_t gives the place for roundup() to not wrap. Also remove locals with single use, and move calculations out from under bo lock. Reported by: tmunro Reviewed by: markj, tmunro Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D46518
* vfs_default.c: trim whitespaceKonstantin Belousov2024-09-041-3/+3
| | | | | Sponsored by: The FreeBSD Foundation MFC after: 3 days
* umtx: shm: 'ushm_refcnt > 0' => 'ushm_refcnt != 0'Olivier Certner2024-09-041-2/+2
| | | | | | | | | | | 'ushm_refcnt' is unsigned. Don't leave the impression it isn't. No functional change (intended). Reviewed by: kib Approved by: emaste (mentor) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D46126
* umtx: shm: Prevent reference counting overflowOlivier Certner2024-09-041-22/+54
| | | | | | | | | | | | | | | | | | | | | | | | | This hardens against provoked use-after-free occurences should there be reference counting leaks in the future (which is currently not the case). At the deepest level, umtx_shm_find_reg_unlocked() now returns EOVERFLOW when it cannot grant an additional reference to the registry object, and so will umtx_shm_find_reg(). umtx_shm_create_reg() will fail if calling umtx_shm_find_reg() returns EOVERFLOW (meaning a SHM object for the passed key already exists, but we can't acquire another reference on it), avoiding the creation of a duplicate registry entry for a given key (this wouldn't pose problem for the rest of the code in its current form, but is expressly avoided for intelligibility and hardening purposes). Since umtx_shm_find_reg*(), and consequently the whole _umtx_op() system call, can only return EOVERFLOW on such a bug manifesting, we don't document that return value. Reviewed by: kib, emaste Approved by: emaste (mentor) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D46126
* umtx: shm: Fix use-after-free due to multiple drops of the registry referenceOlivier Certner2024-09-041-18/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | umtx_shm_unref_reg_locked() would unconditionally drop the "registry" reference, tied to USHMF_LINKED. This is not a problem for caller umtx_shm_object_terminated(), which operates under the 'umtx_shm_lock' lock end-to-end, but it is for indirect caller umtx_shm(), which drops the lock between umtx_shm_find_reg() and the call to umtx_shm_unref_reg(true) that deregisters the umtx shared region (from 'umtx_shm_registry'; umtx_shm_find_reg() only finds registered shared mutexes). Thus, two concurrent user-space callers of _umtx_op() with UMTX_OP_SHM and flags UMTX_SHM_DESTROY, both progressing past umtx_shm_find_reg() but before umtx_shm_unref_reg(true), would then decrease twice the reference count for the single reference standing for the shared mutex's registration. Reported by: Synacktiv Reviewed by: kib Approved by: emaste (mentor) Security: FreeBSD-SA-24:14.umtx Security: CVE-2024-43102 Security: CAP-01 Sponsored by: The Alpha-Omega Project Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D46126
* umtx: shm: Collapse USHMF_REG_LINKED and USHMF_OBJ_LINKED flagsOlivier Certner2024-09-041-9/+5
| | | | | | | | | | | | | ...into the only USHMF_LINKED, as they are always set or unset together. This is both to stop giving the impression that they can be set/unset independently, which they can't with the current code, and to make it clearer that an upcoming reference counting fix is correct. Reviewed by: kib Approved by: emaste (mentor) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D46126
* subr_bus: Stop checking for failures from malloc(M_WAITOK)Zhenlei Huang2024-09-031-2/+0
| | | | | MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D45852
* boottrace: Stop checking for failures from realloc(M_WAITOK)Zhenlei Huang2024-09-031-3/+0
| | | | | MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D45852
* kern_fail: Stop checking for failures from fp_malloc(M_WAITOK)Zhenlei Huang2024-09-031-5/+4
| | | | | | | `fp_malloc` is defined as a macro that redirects to `malloc`. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D45852
* kernel: Make some compile time constant variables constZhenlei Huang2024-08-302-20/+20
| | | | | | | | | | | | | | Those variables are not going to be changed at runtime. Make them const to avoid potential overwriting. This will also help spotting accidental global variables shadowing, since the variable's name such as `version` is short and commonly used. This change was inspired by reviewing khng's work D44760. No functional change intended. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D45227
* rangelocks: stop caching per-thread rl_q_entryKonstantin Belousov2024-08-292-25/+3
| | | | | | | | | This should reduce the frequency of smr_synchronize() calls, that otherwise occur on almost each rangelock unlock. Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D46482
* x86: Detect NVMM hypervisorKevin Bowling2024-08-281-0/+1
| | | | MFC after: 1 week
* rangelocks: remove unneeded cast of the atomic_load_ptr() resultKonstantin Belousov2024-08-281-4/+4
| | | | | | Noted and reviewed by: markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D46465
* rangelocks: re-enable cheat modeKonstantin Belousov2024-08-281-1/+1
| | | | | | | Tested by: lwhsu Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D46465
* kern_copy_file_range(): handle rangelock recursionKonstantin Belousov2024-08-281-5/+7
| | | | | | | | PR: 281073 Reviewed by: markj Tested by: lwhsu Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D46465
* Add rangelock_may_recurse(9)Konstantin Belousov2024-08-281-0/+41
| | | | | | | Reviewed by: markj Tested by: lwhsu Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D46465
* rangelocks: extract the cheat mode drain codeKonstantin Belousov2024-08-281-11/+19
| | | | | | | Reviewed by: markj Tested by: lwhsu Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D46465
* rangelock: Disable cheat mode by defaultMark Johnston2024-08-271-1/+1
| | | | | | | | | | | | Cheat mode is incompatible with code which locks multiple ranges in the same vnode, with at least one range being write-locked. This can arise in kern_copy_file_range(). Until that's handled somehow, avoid the problem to make the fusefs tests stable. PR: 281073 Fixes: 9ef425e560a9 ("rangelocks: add fast cheating mode") Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D46457
* rangelock: Fix an off-by-one errorMark Johnston2024-08-271-1/+1
| | | | | | | | | | A rangelock entry covers the range [start, end), so entries e1 and e2 with e1->end == e2->start do not overlap. PR: 281073 Fixes: 5badbeeaf061 ("Re-implement rangelocks part 2") Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D46458
* sysent: regen after d0675399Mariusz Zaborski2024-08-271-1/+1
|
* capsicum: allow subset of wait4(2) functionalityEdward Tomasz Napierala2024-08-272-1/+13
| | | | | | | | | | | | | | | | | The usual way of handling process exit exit in capsicum(4) mode is by using process descriptors (pdfork(2)) instead of the traditional fork(2)/wait4(2) API. But most apps hadn't been converted this way, and many cannot because the wait is hidden behind a library APIs that revolve around PID numbers and not descriptors; GLib's g_spawn_check_wait_status(3) is one example. Thus, provide backwards compatibility by allowing the wait(2) family of functions in Capsicum mode, except for child processes created by pdfork(2). Reviewed by: brooks, oshogbo Sponsored by: Innovate UK Differential Revision: https://reviews.freebsd.org/D44372
* kern: Align the declaration of kernconfstring with its definitionZhenlei Huang2024-08-221-3/+3
| | | | | | | | | It is defined as const char[] in config.c which is auto generated by usr.sbin/config/kernconf.tmpl . While here prefer macro SYSCTL_CONST_STRING to avoid casting. MFC after: 1 week
* rangelocks: fix typo in rl_w_validateKonstantin Belousov2024-08-211-1/+1
| | | | | | | | The freed elements should be threaded using rl_q_free pointer. Reported by: dougm, markj Tested by: markj Sponsored by: The FreeBSD Foundation
* rangelocks: recheck that entry is not marked after sleepq is locked in ↵Konstantin Belousov2024-08-211-0/+6
| | | | | | | | | rl_w_validate() otherwise we might loose the wakeup. Reported and tested by: markj Sponsored by: The FreeBSD Foundation
* rangelock: if CAS for removal failed, restart list iterationKonstantin Belousov2024-08-211-6/+11
| | | | | | | Our next pointer is invalid and cannot be followed. Tested by: markj, pho Sponsored by: The FreeBSD Foundation
* rangelock: assert that we never insert or remove our entry after a logically ↵Konstantin Belousov2024-08-211-0/+2
| | | | | | | deleted one Tested by: markj, pho Sponsored by: The FreeBSD Foundation
* rangelock_destoy(): poison lock->head to trip fault on lock attemptKonstantin Belousov2024-08-211-0/+1
| | | | | Tested by: markj, pho Sponsored by: The FreeBSD Foundation
* ranglelock_destroy(): do not remove lock entries from under live lock acquirerKonstantin Belousov2024-08-211-9/+47
| | | | | Tested by: markj, pho Sponsored by: The FreeBSD Foundation
* rangelocks: add rangelock_free_free() helper to free free listKonstantin Belousov2024-08-211-14/+24
| | | | | Tested by: markj, pho Sponsored by: The FreeBSD Foundation
* init_main: Sprinkle const qualifiers where appropriateZhenlei Huang2024-08-211-3/+3
| | | | | | No functional change intended. MFC after: 1 week
* socket: Set lock flags properlyMark Johnston2024-08-201-1/+1
| | | | | | | Fixes: fb901935f257 ("socket: Split up sosend_generic()") Reported by: cy Sponsored by: Klara, Inc. Sponsored by: Stormshield
* socket: Microoptimize soreceive_stream_locked()Mark Johnston2024-08-191-5/+3
| | | | | | | | | There is no need to hold the sockbuf lock while checking uio_resid. No functional change intended. MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Stormshield
* socket: Split up sosend_generic()Mark Johnston2024-08-191-18/+29
| | | | | | | | | | | | | | | | Factor out the bits that run with the sock I/O lock held into a separate function. In this implementation, we are doing a bit more work under the I/O lock than before. However, lock contention is only a problem when multiple threads are transmitting on the same socket, which is an unusual case that is not expected to perform well in any case. No functional change intended. Reviewed by: gallatin, glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D46305
* socket: Split up soreceive_generic()Mark Johnston2024-08-191-17/+34
| | | | | | | | | | | Factor out the bits that run with the sock I/O lock held into a separate function. No functional change intended. Reviewed by: gallatin, glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D46304