aboutsummaryrefslogtreecommitdiff
path: root/sys/netinet
Commit message (Collapse)AuthorAgeFilesLines
* netinet: Remove unneeded mb_unmapped_to_ext() callsMark Johnston2021-11-242-21/+6
| | | | | | | | | | in_cksum_skip() now handles unmapped mbufs on platforms where they're permitted. Reviewed by: glebius, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33097
* netinet: Implement in_cksum_skip() using m_apply()Mark Johnston2021-11-241-31/+32
| | | | | | | | | | | | This allows it to work with unmapped mbufs. In particular, in_cksum_skip() calls no longer need to be preceded by calls to mb_unmapped_to_ext() to avoid a page fault. PR: 259645 Reviewed by: gallatin, glebius, jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33096
* netinet: Deduplicate most in_cksum() implementationsMark Johnston2021-11-241-0/+257
| | | | | | | | | | | | | | | | | | | in_cksum() and related routines are implemented separately for each platform, but only i386 and arm have optimized versions. Other platforms' copies of in_cksum.c are identical except for style differences and support for big-endian CPUs. Deduplicate the implementations for the rest of the platforms. This will make it easier to implement in_cksum() for unmapped mbufs. On arm and i386, define HAVE_MD_IN_CKSUM to mean that the MI implementation is not to be compiled. No functional change intended. Reviewed by: kp, glebius MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33095
* netinet: Remove in_cksum.cMark Johnston2021-11-241-148/+0
| | | | | | | | | | It does not get compiled into the kernel. No functional change inteneded. Reviewed by: kp, glebius, cy MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D33094
* cc_newreno(4): Fix a typo in a source code commentGordon Bergling2021-11-191-1/+1
| | | | | | - s/conditons/conditions/ MFC after: 3 days
* Add tcp_freecb() - single place to free tcpcb.Gleb Smirnoff2021-11-194-104/+101
| | | | | | | | | | | | Until this change there were two places where we would free tcpcb - tcp_discardcb() in case if all timers are drained and tcp_timer_discard() otherwise. They were pretty much copy-n-paste, except that in the default case we would run tcp_hc_update(). Merge this into single function tcp_freecb() and move new short version of tcp_timer_discard() to tcp_timer.c and make it static. Reviewed by: rrs, hselasky Differential revision: https://reviews.freebsd.org/D32965
* tcp_timewait: use on stack struct tcptw as last resortGleb Smirnoff2021-11-191-4/+5
| | | | | | | | | In case we failed to uma_zalloc() and also failed to reuse with tcp_tw_2msl_scan(), then just use on stack tcptw. This will allow to run through tcp_twrespond() and standard tcpcb discard routine. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D32965
* tcp: Rack ack war with a mis-behaving firewall or nat with resets.Randall Stewart2021-11-173-15/+62
| | | | | | | | | | | | | | | | | | Previously we added ack-war prevention for misbehaving firewalls. This is where the f/w or nat messes up its sequence numbers and causes an ack-war. There is yet another type of ack war that we have found in the wild that is like unto this. Basically the f/w or nat gets a ack (keep-alive probe or such) and instead of turning the ack/seq around and adding a TH_RST it does something real stupid and sends a new packet with seq=0. This of course triggers the challenge ack in the reset processing which then sends in a challenge ack (if the seq=0 is within the range of possible sequence numbers allowed by the challenge) and then we rinse-repeat. This will add the needed tweaks (similar to the last ack-war prevention using the same sysctls and counters) to prevent it and allow say 5 per second by default. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32938
* sctp: Remove now-unneeded mb_unmapped_to_ext() callsMark Johnston2021-11-162-18/+0
| | | | | | | | | | | sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply(). No functional change intended. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32942
* sctp: Use m_apply() to calcuate a checksum for an mbuf chainMark Johnston2021-11-161-24/+21
| | | | | | | | | | | | | | | | | m_apply() works on unmapped mbufs, so this will let us elide mb_unmapped_to_ext() calls preceding sctp_calculate_cksum() calls in the network stack. Modify sctp_calculate_cksum() to assume it's passed an mbuf header. This assumption appears to be true in practice, and we need to know the full length of the chain. No functional change intended. Reviewed by: tuexen, jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32941
* kernel: partially revert e9efb1125a15, default inet maskMike Karels2021-11-141-4/+13
| | | | | | | | | | When no mask is supplied to the ioctl adding an Internet interface address, revert to using the historical class mask rather than a single default. Similarly for the NFS bootp code. MFC after: 3 weeks Reviewed by: melifaro glebius Differential Revision: https://reviews.freebsd.org/D32951
* tcp: Fix a locking issue related to loggingMichael Tuexen2021-11-141-15/+23
| | | | | | | | | | | | tcp_respond() is sometimes called with only a read lock. The logging however, requires a write lock. So either try to upgrade the lock if needed, or don't log the packet. Reported by: syzbot+8151ef969c170f76706b@syzkaller.appspotmail.com Reported by: syzbot+eb679adb3304c511c1e4@syzkaller.appspotmail.com Reviewed by: markj, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D32983
* tcp_usr_detach: revert debugging piece from f5cf1e5f5a500.Gleb Smirnoff2021-11-131-20/+3
| | | | | | | | The code was probably useful during the problem being chased down, but for brevity makes sense just to return to the original KASSERT. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D32968
* tcp_timers: check for (INP_TIMEWAIT | INP_DROPPED) only onceGleb Smirnoff2021-11-131-37/+4
| | | | | | | | | | All timers keep inpcb locked through their execution. We need to check these flags only once. Checking for INP_TIMEWAIT earlier is is also safer, since such inpcbs point into tcptw rather than tcpcb, and any dereferences of inp_ppcb as tcpcb are erroneous. Reviewed by: rrs, hselasky Differential revision: https://reviews.freebsd.org/D32967
* tcp: Fix a locking issueMichael Tuexen2021-11-121-4/+9
| | | | | | | | | | INP_WLOCK_RECHECK_CLEANUP() and INP_WLOCK_RECHECK() might return from the function, so any locks held must be released. Reported by: syzbot+b1a888df08efaa7b4bf1@syzkaller.appspotmail.com Reviewed by: markj Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D32975
* tcp: Ensure that vnets have an initialized V_default_cc_ptrMark Johnston2021-11-121-0/+17
| | | | | | | | | This causes new vnets to inherit the cc algorithm from vnet0. This is a temporary patch to fix vnet jail creation. With encouragement from: glebius Fixes: b8d60729deef ("tcp: Congestion control cleanup.") Differential Revision: https://reviews.freebsd.org/D32970
* tcp: better congestion control defaultsWarner Losh2021-11-121-0/+7
| | | | | | | | | | | | Define CC_NEWRENO in all the appropriate DEFAULTS and std.* config files. It's the default congestion control algorithm. Add code to cc.c so that CC_DEFAULT is "newreno" if it's not overriden in the config file. Sponsored by: Netflix Fixes: b8d60729deef ("tcp: Congestion control cleanup.") Revired by: manu, hselasky, jhb, glebius, tuexen Differential Revision: https://reviews.freebsd.org/D32964
* Add net.inet.ip.source_address_validationGleb Smirnoff2021-11-121-0/+16
| | | | | | | | | | | Drop packets arriving from the network that have our source IP address. If maliciously crafted they can create evil effects like an RST exchange between two of our listening TCP ports. Such packets just can't be legitimate. Enable the tunable by default. Long time due for a modern Internet host. Reviewed by: donner, melifaro Differential revision: https://reviews.freebsd.org/D32914
* Add in_localip_fib(), in6_localip_fib().Gleb Smirnoff2021-11-122-0/+19
| | | | | | | Check if given address/FIB exists locally. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D32913
* ip_input: packet filters shall not modify m_pkthdr.rcvifGleb Smirnoff2021-11-121-3/+2
| | | | | | | | | Quick review confirms that they do not, also IPv6 doesn't expect such a change in mbuf. In IPv4 this appeared in 0aade26e6d061, which doesn't seem to have a valid explanation why. Reviewed by: donner, kp, melifaro Differential revision: https://reviews.freebsd.org/D32913
* Rename net.inet.ip.check_interface to rfc1122_strong_es and document it.Gleb Smirnoff2021-11-121-43/+26
| | | | | | | | | | | | | | | | | This very questionable feature was enabled in FreeBSD for a very short time. It was disabled very soon upon merging to RELENG_4 - 23d7f14119bf. And in HEAD was also disabled pretty soon - 4bc37f9836fb1. The tunable has very vague name. Check interface for what? Given that it was never documented and almost never enabled, I think it is fine to rename it together with documenting it. Also, count packets dropped by this tunable as ips_badaddr, otherwise they fall down to ips_cantforward counter, which is misleading, as packet was not supposed to be forwarded, it was destined locally. Reviewed by: donner, kp Differential revision: https://reviews.freebsd.org/D32912
* net: sprinkle __predict_false in ip_input on error conditionsMateusz Guzik2021-11-121-12/+15
| | | | | | | | | | While here rearrange the RVSP check to inspect proto first and avoid evaluating V_rsvp in the common case to begin with (most notably avoid the expensive read). Reviewed by: glebius Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D32929
* tcp: Rack may still calculate long RTT on persists probes.Randall Stewart2021-11-113-29/+163
| | | | | | | | | | | | | | | | | | When a persists probe is lost, we will end up calculating a long RTT based on the initial probe and when the response comes from the second probe (or third etc). This means we have a minimum of a confidence level of 3 on a incorrect probe. This commit will change it so that we have one of two options a) Just not count RTT of probes where we had a loss <or> b) Count them still but degrade the confidence to 0. I have set in this the default being to just not measure them, but I am open to having the default be otherwise. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32897
* tcp: Congestion control cleanup.Randall Stewart2021-11-1112-320/+668
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!! This patch does a bit of cleanup on TCP congestion control modules. There were some rather interesting surprises that one could get i.e. where you use a socket option to change from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get a memory failure and end up on cc_newreno. This is not what one would expect. The new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK and pass in to the init function preallocated memory. The CC init is expected in this case *not* to fail but if it does and a module does break the "no fail with memory given" contract we do fall back to the CC that was in place at the time. This also fixes up a set of common newreno utilities that can be shared amongst other CC modules instead of the other CC modules reaching into newreno and executing what they think is a "common and understood" function. Lets put these functions in cc.c and that way we have a common place that is easily findable by future developers or bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE and HYSTART++ without having to dance through hoops for other CC modules, instead both newreno and the other modules just call into the common functions if they desire that behavior or roll there own if that makes more sense. Note: This commit changes the kernel configuration!! If you are not using GENERIC in some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC, CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined as well if you desire. Note that if you create a kernel configuration that does not define a congestion control module and includes INET or INET6 the kernel compile will break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\" but you can specify any string that represents the name of the CC module (same names that show up in the CC module list under net.inet.tcp.cc). If you fail to add the options CC_DEFAULT in your kernel configuration the kernel build will also break. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. RELNOTES:YES Differential Revision: https://reviews.freebsd.org/D32693
* Don't require the socket lock for sorele().John Baldwin2021-11-091-1/+0
| | | | | | | | | | | | | | | | | | | | Previously, sorele() always required the socket lock and dropped the lock if the released reference was not the last reference. Many callers locked the socket lock just before calling sorele() resulting in a wasted lock/unlock when not dropping the last reference. Move the previous implementation of sorele() into a new sorele_locked() function and use it instead of sorele() for various places in uipc_socket.c that called sorele() while already holding the socket lock. The sorele() macro now uses refcount_release_if_not_last() try to drop the socket reference without locking the socket. If that shortcut fails, it locks the socket and calls sorele_locked(). Reviewed by: kib, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D32741
* kernel: deprecate Internet Class A/B/CMike Karels2021-11-092-15/+26
| | | | | | | | | | | | Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined; define it for user level. Define IN_MULTICAST separately from IN_CLASSD, and use it in pf instead of IN_CLASSD. Stop using class for setting default masks when not specified; instead, define new default mask (24 bits). Warn when an Internet address is set without a mask. MFC after: 1 month Reviewed by: cy Differential Revision: https://reviews.freebsd.org/D32708
* tcp: Printf should be removed.Randall Stewart2021-11-081-3/+0
| | | | | | | | | | | There is a printf when a socket option down to the CC module fails, this really should not be a printf. In fact this whole option needs to be re-thought in coordination with some other changes in the CC modules (its just not right but its ok what it does here if it fails since it will just use the ECN beta). Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32894
* Use layer five checksum flags in the mbuf packet header to pass on crypto state.Hans Petter Selasky2021-11-041-1/+10
| | | | | | | | | | | | | | | The mbuf protocol flags get cleared between layers, and also it was discovered that M_DECRYPTED conflicts with M_HASFCS when receiving ethernet patckets. Add the proper CSUM_TLS_MASK and CSUM_TLS_DECRYPTED defines, and start using these instead of M_DECRYPTED inside the TCP LRO code. This change is needed by coming TLS RX hardware offload support patches. Suggested by: kib@ Reviewed by: jhb@ MFC after: 1 week Sponsored by: NVIDIA Networking
* SIFTR: Fix compilation with -DSIFTR_IPV6Allan Jude2021-11-041-4/+5
| | | | | | | | | | | A few pieces of the SIFTR code that are behind #ifdef SIFTR_IPV6 have not been updated as APIs have changed, etc. Reported by: Alexander Sideropoulos <Alexander.Sideropoulos@netapp.com> Reviewed by: rscheff, lstewart Sponsored by: NetApp Sponsored by: Klara Inc. Differential Revision: https://reviews.freebsd.org/D32698
* blackhole(4): disable for locally originated TCP/UDP packetsGleb Smirnoff2021-11-033-3/+25
| | | | | | | | | In most cases blackholing for locally originated packets is undesired, leads to different kind of lags and delays. Provide sysctls to enforce it, e.g. for debugging purposes. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D32718
* Fix a common typo in syctl descriptionsGordon Bergling2021-11-032-2/+2
| | | | | | - s/maxiumum/maximum/ MFC after: 3 days
* udp_input: remove a BSD stack relictGleb Smirnoff2021-11-031-16/+6
| | | | | | | | | | | | | | | | I should had removed it 9 years ago in 8ad458a471ca. That commit left save_ip as a write-only variable. With save_ip removed we got one case when IP header can be modified: the calculation of IP checksum with zeroed out header. This place already has had a header saver char b[9]. However, the b[9] saver didn't cover the ip_sum field, which we explicitly overwrite aliased as (struct ipovly *)->ih_len. This was fine in cb34210012d4e, since checksum doesn't need to be restored if packet is consumed. Now we need to extend up to ip_sum field. In collaboration with: ae Differential revision: https://reviews.freebsd.org/D32719
* netinet: Fix a common typo in source code commentsGordon Bergling2021-11-032-4/+4
| | | | | | - s/writting/writing/ MFC after: 3 days
* ip_divert: calculate delayed checksum for IPv6 adress familyAndrey V. Elsukov2021-11-031-0/+19
| | | | | | | | | | | Before passing an IPv6 packet to application apply delayed checksum calculation. Mbuf flags will be lost when divert listener will return a packet back, so we will not be able to do delayed checksum calculation later. Also an application will get a packet with correct checksum. Reviewed by: donner MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D32807
* inet: remove tcp_debug from netinet/tcp_debug.hMateusz Guzik2021-11-011-6/+0
| | | | | | | | It was a hack only needed for trpt, which can just define it locally. This makes it possible to fix up systat which also includes the file. Sponsored by: Rubicon Communications, LLC ("Netgate")
* carp: deal with negative net.inet.carp.demotionMarius Halden2021-11-011-1/+3
| | | | | | | | | | | | | | | | | | | | | | Given nodes 1 and 2, where node 1 has an advskew of 0 and node 2 has an advskew of 100, making them master and backup respectively. If net.inet.carp.demotion is set to a negative value on node 1, node 2 might become master while node 1 still retains it master status. Wether or not node 2 becomes master seems to depend on the nodes advskew and what the demotion sysctl was set to on node 1. The reason for node 2 becoming master seems to be that the calculated advskew taking demotion into account is truncated to a single unsigned byte when copied into the carp header for sending, and node 1 stays master since it takes uses the whole non-truncated calculated advskew when deciding wether to stay master. PR: 259528 Reviewed by: donner, glebius MFC after: 3 weeks Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D32759
* tcp: Rack might retransmit forever.Randall Stewart2021-10-292-33/+71
| | | | | | | | | | If we get a Sacked peer with an MTU change we can retransmit forever if the last bytes are sacked and the client goes away (think power off). Then we never see the end condition and continually retransmit. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32671
* tcp: Rack at times can miscalculate the RTT from what it thinks is a ↵Randall Stewart2021-10-291-7/+14
| | | | | | | | | | | | | | persists probe respone. Turns out that if a peer sends in a window update right after rack fires off a persists probe, we can mis-interpret the window update and calculate a bogus RTT (very short). We still process the window update and send the data but we incorrectly generate an RTT. We should be only doing the RTT stuff if the rwnd is still small and has not changed. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32717
* Enable net.inet.tcp.nolocaltimewait.Gleb Smirnoff2021-10-281-3/+3
| | | | | This feature has been used for many years at large sites and didn't show any pitfalls.
* mroute: add missing WUNLOCKWojciech Macek2021-10-281-1/+3
| | | | | | | Add missing WNLOCK as in all other error cases. Reported by: Stormshield Obtained from: Semihalf
* mroute: fix memory leakWojciech Macek2021-10-281-0/+3
| | | | | | | | | Add MFC to linked list to store incoming packets before MCAST JOIN was captured. Sponsored by: Stormshield Obtained from: Semihalf MFC after: 2 weeks
* rack: Update the fast send block on setsockopt(2)Gleb Smirnoff2021-10-271-1/+28
| | | | | | | | | | Rack caches TCP/IP header for fast send, so it doesn't call tcpip_fillheaders(). After certain socket option changes, namely IPV6_TCLASS, IP_TOS and IP_TTL it needs to update its fast block to be in sync with the inpcb. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
* Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.Gleb Smirnoff2021-10-275-37/+84
| | | | | | | | | | Pass control for IP/IP6 level options from generic tcp_ctloutput_set() down to per-stack ctloutput. Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput(). Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
* Several IP level socket options may affect TCP.Gleb Smirnoff2021-10-271-21/+54
| | | | | | | | | | | | | After handling them in IP level ctloutput, pass them down to TCP ctloutput. We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place for now, but comment out how it should be handled. For IPv4 we are interested in IP_TOS and IP_TTL. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
* Split tcp_ctloutput() into set/get parts.Gleb Smirnoff2021-10-271-35/+90
| | | | | Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D32655
* tcp: socket option to get stack alias namePeter Lei2021-10-274-4/+44
| | | | | | | | TCP stack sysctl nodes are currently inserted using the stack name alias. Allow the user to get the current stack's alias to allow for programatic sysctl access. Obtained from: Netflix
* tcp: The rack stack can incorrectly have an overflow when calculating a ↵Randall Stewart2021-10-261-3/+4
| | | | | | | | | | | | | burst delay. If the congestion window is very large the fact that we multiply it by 1000 (for microseconds) can cause the uint32_t to overflow and we incorrectly calculate a very small divisor. This will then cause the burst timer to be very large when it should be 0. Instead lets make the three variables uint64_t and avoid the issue. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32668
* tcp: allow new reno functions to be called from other CC modulesMichael Tuexen2021-10-251-13/+34
| | | | | | | | | | | | Some new reno functions use the internal data, but are also called from functions of other CC modules. Ensure that in this case, the internal data is not accessed. Reported by: syzbot+1d219ea351caa5109d4b@syzkaller.appspotmail.com Reported by: syzbot+b08144f8cad9c67258c5@syzkaller.appspotmail.com Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D32649
* Don't run ip_ctloutput() for divert socket.Gleb Smirnoff2021-10-251-1/+0
| | | | | | | | | It was here since divert(4) was introduced, probably just came with a protocol definition boilerplate. There is no useful socket option that can be set or get for a divert socket. Reviewed by: donner Differential Revision: https://reviews.freebsd.org/D32608
* Remove div_ctlinput().Gleb Smirnoff2021-10-251-13/+0
| | | | | | | | This function does nothing since 97d8d152c28b. It was introduced in 252f24a2cf40 with a sidenote "may not be needed". Reviewed by: donner Differential Revision: https://reviews.freebsd.org/D32608