aboutsummaryrefslogtreecommitdiff
path: root/sys/netinet/in_pcb.c
Commit message (Collapse)AuthorAgeFilesLines
...
* Continue work to optimize performance of "options MAC" when no MAC policyRobert Watson2009-06-031-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | modules are loaded by avoiding mbuf label lookups when policies aren't loaded, pushing further socket locking into MAC policy modules, and avoiding locking MAC ifnet locks when no policies are loaded: - Check mac_policies_count before looking for mbuf MAC label m_tags in MAC Framework entry points. We will still pay label lookup costs if MAC policies are present but don't require labels (typically a single mbuf header field read, but perhaps further indirection if IPSEC or other m_tag consumers are in use). - Further push socket locking for socket-related access control checks and events into MAC policies from the MAC Framework, so that sockets are only locked if a policy specifically requires a lock to protect a label. This resolves lock order issues during sonewconn() and also in local domain socket cross-connect where multiple socket locks could not be held at once for the purposes of propagatig MAC labels across multiple sockets. Eliminate mac_policy_count check in some entry points where it no longer avoids locking. - Add mac_policy_count checking in some entry points relating to network interfaces that otherwise lock a global MAC ifnet lock used to protect ifnet labels. Obtained from: TrustedBSD Project Notes: svn path=/head/; revision=193391
* - Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistentPawel Jakub Dawidek2009-06-011-7/+3
| | | | | | | | | | | | | | | | with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option as there is no more space for more SOL_SOCKET options, but this option also fits better as an IP socket option, it seems. - Implement this functionality also for IPv6 and RAW IP sockets. - Always compile it in (don't use additional kernel options). - Remove sysctl to turn this functionality on and off. - Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this functionality (currently only unjail root can use it). Discussed with: julian, adrian, jhb, rwatson, kmacy Notes: svn path=/head/; revision=193217
* Add hierarchical jails. A jail may further virtualize its environmentJamie Gritton2009-05-271-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor) Notes: svn path=/head/; revision=192895
* Staticize two functions not used outside of in_pcb.c: in_pcbremlists() andRobert Watson2009-05-141-2/+4
| | | | | | | | | db_print_inpcb(). MFC after: 1 month Notes: svn path=/head/; revision=192116
* Permit buiding kernels with options VIMAGE, restricted to only a singleMarko Zec2009-04-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_* macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_* family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor) Notes: svn path=/head/; revision=191688
* Do not assume that ip6_moptions is always set, it isBruce M Simpson2009-04-291-1/+2
| | | | | | | a lazy-allocated structure. Notes: svn path=/head/; revision=191658
* Lock interface address lists in in_pcbladdr() when searching for aRobert Watson2009-04-191-1/+12
| | | | | | | | | | source address for a connection and there's no route or now interface for the route. MFC after: 2 weeks Notes: svn path=/head/; revision=191286
* Correct a number of evolved problems with inp_vflag and inp_flags:Robert Watson2009-03-151-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | certain flags that should have been in inp_flags ended up in inp_vflag, meaning that they were inconsistently locked, and in one case, interpreted. Move the following flags from inp_vflag to gaps in the inp_flags space (and clean up the inp_flags constants to make gaps more obvious to future takers): INP_TIMEWAIT INP_SOCKREF INP_ONESBCAST INP_DROPPED Some aspects of this change have no effect on kernel ABI at all, as these are UDP/TCP/IP-internal uses; however, netstat and sockstat detect INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this into account. MFC after: 1 week (or after dependencies are MFC'd) Reviewed by: bz Notes: svn path=/head/; revision=189848
* Add INP_INHASHLIST flag for inpcb->inp_flags to indicate whetherRobert Watson2009-03-111-4/+10
| | | | | | | | | | | | | or not the inpcb is currenty on various hash lookup lists, rather than using (lport != 0) to detect this. This means that the full 4-tuple of a connection can be retained after close, which should lead to more sensible netstat output in the window between TCP close and socket close. MFC after: 2 weeks Notes: svn path=/head/; revision=189657
* Remove redundant calls of prison_local_ip4 in in_pcbbind_setup, and ofJamie Gritton2009-02-051-11/+4
| | | | | | | | | prison_local_ip6 in in6_pcbbind. Approved by: bz (mentor) Notes: svn path=/head/; revision=188148
* Standardize the various prison_foo_ip[46] functions and prison_if toJamie Gritton2009-02-051-29/+28
| | | | | | | | | | | | | | | | | | return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor) Notes: svn path=/head/; revision=188144
* For consistency with prison_{local,remote,check}_ipN renameBjoern A. Zeeb2009-01-251-4/+4
| | | | | | | | | | prison_getipN to prison_get_ipN. Submitted by: jamie (as part of a larger patch) MFC after: 1 week Notes: svn path=/head/; revision=187684
* Fix fat-fingered comment.Adrian Chadd2009-01-091-1/+1
| | | | | | | Noticed-by: julian Notes: svn path=/head/; revision=186963
* Comment some potentially confusing logic.Adrian Chadd2009-01-091-0/+5
| | | | | | | | | Nitpicking by: mlaier MFC after: 2 weeks Notes: svn path=/head/; revision=186959
* Implement a new IP option (not compiled/enabled by default) to allowAdrian Chadd2009-01-091-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | applications to specify a non-local IP address when bind()'ing a socket to a local endpoint. This allows applications to spoof the client IP address of connections if (obviously!) they somehow are able to receive the traffic normally destined to said clients. This patch doesn't include any changes to ipfw or the bridging code to redirect the client traffic through the PCB checks so TCP gets a shot at it. The normal behaviour is that packets with a non-local destination IP address are not handled locally. This can be dealth with some IPFW hackery; modifications to IPFW to make this less hacky will occur in subsequent commmits. Thanks to Julian Elischer and others at Ironport. This work was approved and donated before Cisco acquired them. Obtained from: Julian Elischer and others MFC after: 2 weeks Notes: svn path=/head/; revision=186955
* Use inc_flags instead of the inc_isipv6 alias which so farBjoern A. Zeeb2008-12-171-1/+1
| | | | | | | | | | | | | | | | | had been the only flag with random usage patterns. Switch inc_flags to be used as a real bit field by using INC_ISIPV6 with bitops to check for the 'isipv6' condition. While here fix a place or two where in case of v4 inc_flags were not properly initialized before.[1] Found by: rwatson during review [1] Discussed with: rwatson Reviewed by: rwatson MFC after: 4 weeks Notes: svn path=/head/; revision=186222
* This main goals of this project are:Qing Li2008-12-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. separating L2 tables (ARP, NDP) from the L3 routing tables 2. removing as much locking dependencies among these layers as possible to allow for some parallelism in the search operations 3. simplify the logic in the routing code, The most notable end result is the obsolescent of the route cloning (RTF_CLONING) concept, which translated into code reduction in both IPv4 ARP and IPv6 NDP related modules, and size reduction in struct rtentry{}. The change in design obsoletes the semantics of RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland applications such as "arp" and "ndp" have been modified to reflect those changes. The output from "netstat -r" shows only the routing entries. Quite a few developers have contributed to this project in the past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and Andre Oppermann. And most recently: - Kip Macy revised the locking code completely, thus completing the last piece of the puzzle, Kip has also been conducting active functional testing - Sam Leffler has helped me improving/refactoring the code, and provided valuable reviews - Julian Elischer setup the perforce tree for me and has helped me maintaining that branch before the svn conversion Notes: svn path=/head/; revision=186119
* Add a check, that is currently under discussion for 8 but that we needBjoern A. Zeeb2008-12-141-0/+4
| | | | | | | | | | | | | | | | | | | | | to keep for 7-STABLE when MFCing in_pcbladdr() to not change the behaviour there. With this a destination route via a loopback interface is treated as a valid and reachable thing for IPv4 source address selection, even though nothing of that network is ever directly reachable, but it is more like a blackhole route. With this the source address will be selected and IPsec can grab the packets before we would discard them at a later point, encapsulate them and send them out from a different tunnel endpoint IP. Discussed on: net Reported by: Frank Behrens <frank@harz.behrens.de> Tested by: Frank Behrens <frank@harz.behrens.de> MFC after: 4 weeks (just so that I get the mail) Notes: svn path=/head/; revision=186086
* Remove inconsistent white space from in_pcballoc().Robert Watson2008-12-101-2/+0
| | | | | | | MFC after: pretty soon Notes: svn path=/head/; revision=185858
* Add a reference count to struct inpcb, which may be explicitlyRobert Watson2008-12-081-12/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | incremented using in_pcbref(), and decremented using in_pcbfree() or inpcbrele(). Protocols using only current in_pcballoc() and in_pcbfree() calls will see the same semantics, but it is now possible for TCP to call in_pcbref() and in_pcbrele() to prevent an inpcb from being freed when both tcbinfo and per-inpcb locks are released. This makes it possible to safely transition from holding only the inpcb lock to both tcbinfo and inpcb lock without re-looking up a connection in the input path, timer path, etc. Notice that in_pcbrele() does not unlock the connection after decrementing the refcount, if the connection remains, so that the caller can continue to use it; in_pcbrele() returns a flag indicating whether or not the inpcb pointer is still valid, and in_pcbfee() is now a simple wrapper around in_pcbrele(). MFC after: 1 month Discussed with: bz, kmacy Reviewed by: bz, gnn, kmacy Tested by: kmacy Notes: svn path=/head/; revision=185773
* Rather than using hidden includes (with cicular dependencies),Bjoern A. Zeeb2008-12-021-0/+2
| | | | | | | | | | | | | | directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=185571
* MFp4:Bjoern A. Zeeb2008-11-291-51/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible Notes: svn path=/head/; revision=185435
* Replace most INP_CHECK_SOCKAF() uses checking if it is anBjoern A. Zeeb2008-11-271-2/+1
| | | | | | | | | | | | IPv6 socket by comparing a constant inp vflag. This is expected to help to reduce extra locking. Suggested by: rwatson Reviewed by: rwatson MFC after: 6 weeks Notes: svn path=/head/; revision=185371
* Merge in6_pcbfree() into in_pcbfree() which after the previousBjoern A. Zeeb2008-11-271-1/+8
| | | | | | | | | | | | | IPsec change in r185366 only differed in two additonal IPv6 lines. Rather than splattering conditional code everywhere add the v6 check centrally at this single place. Reviewed by: rwatson (as part of a larger changset) MFC after: 6 weeks (*) (*) possibly need to leave a stub wrapper in 7 to keep the symbol. Notes: svn path=/head/; revision=185370
* Unify ipsec[46]_delete_pcbpolicy in ipsec_delete_pcbpolicy.Bjoern A. Zeeb2008-11-271-1/+1
| | | | | | | | | | | | Ignoring different names because of macros (in6pcb, in6p_sp) and inp vs. in6p variable name both functions were entirely identical. Reviewed by: rwatson (as part of a larger changeset) MFC after: 6 weeks (*) (*) possibly need to leave a stub wrappers in 7 to keep the symbols. Notes: svn path=/head/; revision=185366
* Merge more of currently non-functional (i.e. resolving toMarko Zec2008-11-261-1/+2
| | | | | | | | | | | | | | | | | | | | whitespace) macros from p4/vimage branch. Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks. De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation Notes: svn path=/head/; revision=185348
* Unify the v4 and v6 versions of pcbdetach and pcbfree as goodBjoern A. Zeeb2008-11-261-3/+3
| | | | | | | | | | | | as possible so that they are easily diffable. No functional changes. Reviewed by: rwatson MFC after: 6 weeks Notes: svn path=/head/; revision=185333
* Change the initialization methodology for global variables scheduledMarko Zec2008-11-191-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation Notes: svn path=/head/; revision=185088
* Retire the MALLOC and FREE macros. They are an abomination unto style(9).Dag-Erling Smørgrav2008-10-231-2/+2
| | | | | | | MFC after: 3 months Notes: svn path=/head/; revision=184205
* Update a comment which to my reading had been misplaced in rev. 1.12Bjoern A. Zeeb2008-10-201-2/+3
| | | | | | | | | | | | already (but probably had been way above as the code was there twice) and describe what was last changed in rev. 1.199 there (which now is in sync with in6_src.c r184096). Pointed at by: mlaier MFC after: 2 mmonths Notes: svn path=/head/; revision=184097
* Cache so_cred as inp_cred in the inpcb.Bjoern A. Zeeb2008-10-041-4/+8
| | | | | | | | | | | | | | | This means that inp_cred is always there, even after the socket has gone away. It also means that it is constant for the lifetime of the inp. Both facts lead to simpler code and possibly less locking. Suggested by: rwatson Reviewed by: rwatson MFC after: 6 weeks X-MFC Note: use a inp_pspare for inp_cred Notes: svn path=/head/; revision=183606
* Implement IPv4 source address selection for unbound sockets.Bjoern A. Zeeb2008-10-031-43/+205
| | | | | | | | | | | | | | | | | | | | | | | | | For the jail case we are already looping over the interface addresses before falling back to the only IP address of a jail in case of no match. This is in preparation for the upcoming multi-IPv4/v6/no-IP jail patch this change was developed with initially. This also changes the semantics of selecting the IP for processes within a jail as it now uses the same logic as outside the jail (with additional checks) but no longer is on a mutually exclusive code path. Benchmarks had shown no difference at 95.0% confidence for neither the plain nor the jail case (even with the additional overhead). See: http://lists.freebsd.org/pipermail/freebsd-net/2008-September/019531.html Inpsired by a patch from: Yahoo! (partially) Tested by: latest multi-IP jail patch users (implictly) Discussed with: rwatson (general things around this) Reviewed by: mostly silence (feedback from bms) Help with benchmarking from: kris MFC after: 2 months Notes: svn path=/head/; revision=183571
* Step 1.5 of importing the network stack virtualization infrastructureMarko Zec2008-10-021-31/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_*() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(*). (*) netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation Notes: svn path=/head/; revision=183550
* Expand comments relating various detach/free/drop inpcb routines.Robert Watson2008-09-291-9/+31
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=183461
* Commit step 1 of the vimage project, (network stack)Bjoern A. Zeeb2008-08-171-33/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch Notes: svn path=/head/; revision=181803
* Minor white space tweaks.Robert Watson2008-08-071-3/+3
| | | | | | | MFC after: 1 week Notes: svn path=/head/; revision=181365
* Correct comment typo.Robert Watson2008-08-071-2/+2
| | | | | | | MFC after: 1 week (after inpcb rwlocking) Notes: svn path=/head/; revision=181364
* Trying to fix compilation bustage:Tai-hwa Liang2008-07-221-3/+3
| | | | | | | | | | | - removing 'const' qualifier from an input parameter to conform to the type required by rw_assert(); - using in_addr->s_addr to retrive 32 bits address value. Observed by: tinderbox Notes: svn path=/head/; revision=180683
* make new accessor functions consistent with existing styleKip Macy2008-07-211-3/+5
| | | | Notes: svn path=/head/; revision=180678
* Add accessor functions for socket fields.Kip Macy2008-07-211-0/+14
| | | | | | | MFC after: 1 week Notes: svn path=/head/; revision=180641
* add inpcb accessor functions for fields needed by TOE devicesKip Macy2008-07-211-0/+54
| | | | Notes: svn path=/head/; revision=180640
* ia is a pointer thus use NULL rather then 0 for initialization andBjoern A. Zeeb2008-07-201-5/+5
| | | | | | | | | in comparisons to make this more obvious. MFC after: 5 days Notes: svn path=/head/; revision=180629
* Pass the ucred along into in{,6}_pcblookup_local for upcomingBjoern A. Zeeb2008-07-101-7/+7
| | | | | | | | | prison checks. Reviewed by: rwatson Notes: svn path=/head/; revision=180427
* For consistency take lport as u_short in in{,6}_pcblookup_local.Bjoern A. Zeeb2008-07-101-2/+1
| | | | | | | | | All callers either pass in an u_short or u_int16_t. Reviewed by: rwatson Notes: svn path=/head/; revision=180425
* For consistency with the rest of the function use the locally cachedBjoern A. Zeeb2008-07-091-1/+1
| | | | | | | | | pointer pcbinfo rather than inp->inp_pcbinfo. MFC after: 3 weeks Notes: svn path=/head/; revision=180392
* Add code to allow the system to handle multiple routing tables.Julian Elischer2008-05-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco Notes: svn path=/head/; revision=178888
* When querying the local or foreign address from an IP socket, acquireRobert Watson2008-04-191-6/+6
| | | | | | | | | | | only a read lock on the inpcb. When an external module requests a read lock, acquire only a read lock. MFC after: 3 months Notes: svn path=/head/; revision=178318
* Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros toRobert Watson2008-04-171-31/+40
| | | | | | | | | | | | | | | | | | explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive. This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code. MFC after: 3 months Tested by: kris (superset of committered patch) Notes: svn path=/head/; revision=178285
* In in_pcbnotifyall() and in6_pcbnotify(), use LIST_FOREACH_SAFE() andRobert Watson2008-04-061-5/+2
| | | | | | | | | | eliminate unnecessary local variable caching of the list head pointer, making the code a bit easier to read. MFC after: 3 weeks Notes: svn path=/head/; revision=177961
* change inp_wlock_assert to inp_lock_assertKip Macy2008-03-241-2/+2
| | | | Notes: svn path=/head/; revision=177575