diff options
Diffstat (limited to 'share/man/man7/tuning.7')
-rw-r--r-- | share/man/man7/tuning.7 | 997 |
1 files changed, 997 insertions, 0 deletions
diff --git a/share/man/man7/tuning.7 b/share/man/man7/tuning.7 new file mode 100644 index 000000000000..9041181984c3 --- /dev/null +++ b/share/man/man7/tuning.7 @@ -0,0 +1,997 @@ +.\" Copyright (C) 2001 Matthew Dillon. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd January 23, 2009 +.Dt TUNING 7 +.Os +.Sh NAME +.Nm tuning +.Nd performance tuning under FreeBSD +.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP +When using +.Xr bsdlabel 8 +or +.Xr sysinstall 8 +to lay out your file systems on a hard disk it is important to remember +that hard drives can transfer data much more quickly from outer tracks +than they can from inner tracks. +To take advantage of this you should +try to pack your smaller file systems and swap closer to the outer tracks, +follow with the larger file systems, and end with the largest file systems. +It is also important to size system standard file systems such that you +will not be forced to resize them later as you scale the machine up. +I usually create, in order, a 128M root, 1G swap, 128M +.Pa /var , +128M +.Pa /var/tmp , +3G +.Pa /usr , +and use any remaining space for +.Pa /home . +.Pp +You should typically size your swap space to approximately 2x main memory +for systems with less than 2GB of RAM, or approximately 1x main memory +if you have more. +If you do not have a lot of RAM, though, you will generally want a lot +more swap. +It is not recommended that you configure any less than +256M of swap on a system and you should keep in mind future memory +expansion when sizing the swap partition. +The kernel's VM paging algorithms are tuned to perform best when there is +at least 2x swap versus main memory. +Configuring too little swap can lead +to inefficiencies in the VM page scanning code as well as create issues +later on if you add more memory to your machine. +Finally, on larger systems +with multiple SCSI disks (or multiple IDE disks operating on different +controllers), we strongly recommend that you configure swap on each drive. +The swap partitions on the drives should be approximately the same size. +The kernel can handle arbitrary sizes but +internal data structures scale to 4 times the largest swap partition. +Keeping +the swap partitions near the same size will allow the kernel to optimally +stripe swap space across the N disks. +Do not worry about overdoing it a +little, swap space is the saving grace of +.Ux +and even if you do not normally use much swap, it can give you more time to +recover from a runaway program before being forced to reboot. +.Pp +How you size your +.Pa /var +partition depends heavily on what you intend to use the machine for. +This +partition is primarily used to hold mailboxes, the print spool, and log +files. +Some people even make +.Pa /var/log +its own partition (but except for extreme cases it is not worth the waste +of a partition ID). +If your machine is intended to act as a mail +or print server, +or you are running a heavily visited web server, you should consider +creating a much larger partition \(en perhaps a gig or more. +It is very easy +to underestimate log file storage requirements. +.Pp +Sizing +.Pa /var/tmp +depends on the kind of temporary file usage you think you will need. +128M is +the minimum we recommend. +Also note that sysinstall will create a +.Pa /tmp +directory. +Dedicating a partition for temporary file storage is important for +two reasons: first, it reduces the possibility of file system corruption +in a crash, and second it reduces the chance of a runaway process that +fills up +.Oo Pa /var Oc Ns Pa /tmp +from blowing up more critical subsystems (mail, +logging, etc). +Filling up +.Oo Pa /var Oc Ns Pa /tmp +is a very common problem to have. +.Pp +In the old days there were differences between +.Pa /tmp +and +.Pa /var/tmp , +but the introduction of +.Pa /var +(and +.Pa /var/tmp ) +led to massive confusion +by program writers so today programs haphazardly use one or the +other and thus no real distinction can be made between the two. +So it makes sense to have just one temporary directory and +softlink to it from the other +.Pa tmp +directory locations. +However you handle +.Pa /tmp , +the one thing you do not want to do is leave it sitting +on the root partition where it might cause root to fill up or possibly +corrupt root in a crash/reboot situation. +.Pp +The +.Pa /usr +partition holds the bulk of the files required to support the system and +a subdirectory within it called +.Pa /usr/local +holds the bulk of the files installed from the +.Xr ports 7 +hierarchy. +If you do not use ports all that much and do not intend to keep +system source +.Pq Pa /usr/src +on the machine, you can get away with +a 1 gigabyte +.Pa /usr +partition. +However, if you install a lot of ports +(especially window managers and Linux-emulated binaries), we recommend +at least a 2 gigabyte +.Pa /usr +and if you also intend to keep system source +on the machine, we recommend a 3 gigabyte +.Pa /usr . +Do not underestimate the +amount of space you will need in this partition, it can creep up and +surprise you! +.Pp +The +.Pa /home +partition is typically used to hold user-specific data. +I usually size it to the remainder of the disk. +.Pp +Why partition at all? +Why not create one big +.Pa / +partition and be done with it? +Then I do not have to worry about undersizing things! +Well, there are several reasons this is not a good idea. +First, +each partition has different operational characteristics and separating them +allows the file system to tune itself to those characteristics. +For example, +the root and +.Pa /usr +partitions are read-mostly, with very little writing, while +a lot of reading and writing could occur in +.Pa /var +and +.Pa /var/tmp . +By properly +partitioning your system fragmentation introduced in the smaller more +heavily write-loaded partitions will not bleed over into the mostly-read +partitions. +Additionally, keeping the write-loaded partitions closer to +the edge of the disk (i.e., before the really big partitions instead of after +in the partition table) will increase I/O performance in the partitions +where you need it the most. +Now it is true that you might also need I/O +performance in the larger partitions, but they are so large that shifting +them more towards the edge of the disk will not lead to a significant +performance improvement whereas moving +.Pa /var +to the edge can have a huge impact. +Finally, there are safety concerns. +Having a small neat root partition that +is essentially read-only gives it a greater chance of surviving a bad crash +intact. +.Pp +Properly partitioning your system also allows you to tune +.Xr newfs 8 , +and +.Xr tunefs 8 +parameters. +Tuning +.Xr newfs 8 +requires more experience but can lead to significant improvements in +performance. +There are three parameters that are relatively safe to tune: +.Em blocksize , bytes/i-node , +and +.Em cylinders/group . +.Pp +.Fx +performs best when using 8K or 16K file system block sizes. +The default file system block size is 16K, +which provides best performance for most applications, +with the exception of those that perform random access on large files +(such as database server software). +Such applications tend to perform better with a smaller block size, +although modern disk characteristics are such that the performance +gain from using a smaller block size may not be worth consideration. +Using a block size larger than 16K +can cause fragmentation of the buffer cache and +lead to lower performance. +.Pp +The defaults may be unsuitable +for a file system that requires a very large number of i-nodes +or is intended to hold a large number of very small files. +Such a file system should be created with an 8K or 4K block size. +This also requires you to specify a smaller +fragment size. +We recommend always using a fragment size that is 1/8 +the block size (less testing has been done on other fragment size factors). +The +.Xr newfs 8 +options for this would be +.Dq Li "newfs -f 1024 -b 8192 ..." . +.Pp +If a large partition is intended to be used to hold fewer, larger files, such +as database files, you can increase the +.Em bytes/i-node +ratio which reduces the number of i-nodes (maximum number of files and +directories that can be created) for that partition. +Decreasing the number +of i-nodes in a file system can greatly reduce +.Xr fsck 8 +recovery times after a crash. +Do not use this option +unless you are actually storing large files on the partition, because if you +overcompensate you can wind up with a file system that has lots of free +space remaining but cannot accommodate any more files. +Using 32768, 65536, or 262144 bytes/i-node is recommended. +You can go higher but +it will have only incremental effects on +.Xr fsck 8 +recovery times. +For example, +.Dq Li "newfs -i 32768 ..." . +.Pp +.Xr tunefs 8 +may be used to further tune a file system. +This command can be run in +single-user mode without having to reformat the file system. +However, this is possibly the most abused program in the system. +Many people attempt to +increase available file system space by setting the min-free percentage to 0. +This can lead to severe file system fragmentation and we do not recommend +that you do this. +Really the only +.Xr tunefs 8 +option worthwhile here is turning on +.Em softupdates +with +.Dq Li "tunefs -n enable /filesystem" . +(Note: in +.Fx 4.5 +and later, softupdates can be turned on using the +.Fl U +option to +.Xr newfs 8 , +and +.Xr sysinstall 8 +will typically enable softupdates automatically for non-root file systems). +Softupdates drastically improves meta-data performance, mainly file +creation and deletion. +We recommend enabling softupdates on most file systems; however, there +are two limitations to softupdates that you should be aware of when +determining whether to use it on a file system. +First, softupdates guarantees file system consistency in the +case of a crash but could very easily be several seconds (even a minute!\&) +behind on pending write to the physical disk. +If you crash you may lose more work +than otherwise. +Secondly, softupdates delays the freeing of file system +blocks. +If you have a file system (such as the root file system) which is +close to full, doing a major update of it, e.g.\& +.Dq Li "make installworld" , +can run it out of space and cause the update to fail. +For this reason, softupdates will not be enabled on the root file system +during a typical install. +There is no loss of performance since the root +file system is rarely written to. +.Pp +A number of run-time +.Xr mount 8 +options exist that can help you tune the system. +The most obvious and most dangerous one is +.Cm async . +Only use this option in conjunction with +.Xr gjournal 8 , +as it is far too dangerous on a normal file system. +A less dangerous and more +useful +.Xr mount 8 +option is called +.Cm noatime . +.Ux +file systems normally update the last-accessed time of a file or +directory whenever it is accessed. +This operation is handled in +.Fx +with a delayed write and normally does not create a burden on the system. +However, if your system is accessing a huge number of files on a continuing +basis the buffer cache can wind up getting polluted with atime updates, +creating a burden on the system. +For example, if you are running a heavily +loaded web site, or a news server with lots of readers, you might want to +consider turning off atime updates on your larger partitions with this +.Xr mount 8 +option. +However, you should not gratuitously turn off atime +updates everywhere. +For example, the +.Pa /var +file system customarily +holds mailboxes, and atime (in combination with mtime) is used to +determine whether a mailbox has new mail. +You might as well leave +atime turned on for mostly read-only partitions such as +.Pa / +and +.Pa /usr +as well. +This is especially useful for +.Pa / +since some system utilities +use the atime field for reporting. +.Sh STRIPING DISKS +In larger systems you can stripe partitions from several drives together +to create a much larger overall partition. +Striping can also improve +the performance of a file system by splitting I/O operations across two +or more disks. +The +.Xr gstripe 8 , +.Xr gvinum 8 , +and +.Xr ccdconfig 8 +utilities may be used to create simple striped file systems. +Generally +speaking, striping smaller partitions such as the root and +.Pa /var/tmp , +or essentially read-only partitions such as +.Pa /usr +is a complete waste of time. +You should only stripe partitions that require serious I/O performance, +typically +.Pa /var , /home , +or custom partitions used to hold databases and web pages. +Choosing the proper stripe size is also +important. +File systems tend to store meta-data on power-of-2 boundaries +and you usually want to reduce seeking rather than increase seeking. +This +means you want to use a large off-center stripe size such as 1152 sectors +so sequential I/O does not seek both disks and so meta-data is distributed +across both disks rather than concentrated on a single disk. +If +you really need to get sophisticated, we recommend using a real hardware +RAID controller from the list of +.Fx +supported controllers. +.Sh SYSCTL TUNING +.Xr sysctl 8 +variables permit system behavior to be monitored and controlled at +run-time. +Some sysctls simply report on the behavior of the system; others allow +the system behavior to be modified; +some may be set at boot time using +.Xr rc.conf 5 , +but most will be set via +.Xr sysctl.conf 5 . +There are several hundred sysctls in the system, including many that appear +to be candidates for tuning but actually are not. +In this document we will only cover the ones that have the greatest effect +on the system. +.Pp +The +.Va kern.ipc.maxpipekva +loader tunable is used to set a hard limit on the +amount of kernel address space allocated to mapping of pipe buffers. +Use of the mapping allows the kernel to eliminate a copy of the +data from writer address space into the kernel, directly copying +the content of mapped buffer to the reader. +Increasing this value to a higher setting, such as `25165824' might +improve performance on systems where space for mapping pipe buffers +is quickly exhausted. +This exhaustion is not fatal; however, and it will only cause pipes to +to fall back to using double-copy. +.Pp +The +.Va kern.ipc.shm_use_phys +sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on). +Setting +this parameter to 1 will cause all System V shared memory segments to be +mapped to unpageable physical RAM. +This feature only has an effect if you +are either (A) mapping small amounts of shared memory across many (hundreds) +of processes, or (B) mapping large amounts of shared memory across any +number of processes. +This feature allows the kernel to remove a great deal +of internal memory management page-tracking overhead at the cost of wiring +the shared memory into core, making it unswappable. +.Pp +The +.Va vfs.vmiodirenable +sysctl defaults to 1 (on). +This parameter controls how directories are cached +by the system. +Most directories are small and use but a single fragment +(typically 1K) in the file system and even less (typically 512 bytes) in +the buffer cache. +However, when operating in the default mode the buffer +cache will only cache a fixed number of directories even if you have a huge +amount of memory. +Turning on this sysctl allows the buffer cache to use +the VM Page Cache to cache the directories. +The advantage is that all of +memory is now available for caching directories. +The disadvantage is that +the minimum in-core memory used to cache a directory is the physical page +size (typically 4K) rather than 512 bytes. +We recommend turning this option off in memory-constrained environments; +however, when on, it will substantially improve the performance of services +that manipulate a large number of files. +Such services can include web caches, large mail systems, and news systems. +Turning on this option will generally not reduce performance even with the +wasted memory but you should experiment to find out. +.Pp +The +.Va vfs.write_behind +sysctl defaults to 1 (on). +This tells the file system to issue media +writes as full clusters are collected, which typically occurs when writing +large sequential files. +The idea is to avoid saturating the buffer +cache with dirty buffers when it would not benefit I/O performance. +However, +this may stall processes and under certain circumstances you may wish to turn +it off. +.Pp +The +.Va vfs.hirunningspace +sysctl determines how much outstanding write I/O may be queued to +disk controllers system-wide at any given instance. +The default is +usually sufficient but on machines with lots of disks you may want to bump +it up to four or five megabytes. +Note that setting too high a value +(exceeding the buffer cache's write threshold) can lead to extremely +bad clustering performance. +Do not set this value arbitrarily high! +Also, +higher write queueing values may add latency to reads occurring at the same +time. +.Pp +There are various other buffer-cache and VM page cache related sysctls. +We do not recommend modifying these values. +As of +.Fx 4.3 , +the VM system does an extremely good job tuning itself. +.Pp +The +.Va net.inet.tcp.sendspace +and +.Va net.inet.tcp.recvspace +sysctls are of particular interest if you are running network intensive +applications. +They control the amount of send and receive buffer space +allowed for any given TCP connection. +The default sending buffer is 32K; the default receiving buffer +is 64K. +You can often +improve bandwidth utilization by increasing the default at the cost of +eating up more kernel memory for each connection. +We do not recommend +increasing the defaults if you are serving hundreds or thousands of +simultaneous connections because it is possible to quickly run the system +out of memory due to stalled connections building up. +But if you need +high bandwidth over a fewer number of connections, especially if you have +gigabit Ethernet, increasing these defaults can make a huge difference. +You can adjust the buffer size for incoming and outgoing data separately. +For example, if your machine is primarily doing web serving you may want +to decrease the recvspace in order to be able to increase the +sendspace without eating too much kernel memory. +Note that the routing table (see +.Xr route 8 ) +can be used to introduce route-specific send and receive buffer size +defaults. +.Pp +As an additional management tool you can use pipes in your +firewall rules (see +.Xr ipfw 8 ) +to limit the bandwidth going to or from particular IP blocks or ports. +For example, if you have a T1 you might want to limit your web traffic +to 70% of the T1's bandwidth in order to leave the remainder available +for mail and interactive use. +Normally a heavily loaded web server +will not introduce significant latencies into other services even if +the network link is maxed out, but enforcing a limit can smooth things +out and lead to longer term stability. +Many people also enforce artificial +bandwidth limitations in order to ensure that they are not charged for +using too much bandwidth. +.Pp +Setting the send or receive TCP buffer to values larger than 65535 will result +in a marginal performance improvement unless both hosts support the window +scaling extension of the TCP protocol, which is controlled by the +.Va net.inet.tcp.rfc1323 +sysctl. +These extensions should be enabled and the TCP buffer size should be set +to a value larger than 65536 in order to obtain good performance from +certain types of network links; specifically, gigabit WAN links and +high-latency satellite links. +RFC1323 support is enabled by default. +.Pp +The +.Va net.inet.tcp.always_keepalive +sysctl determines whether or not the TCP implementation should attempt +to detect dead TCP connections by intermittently delivering +.Dq keepalives +on the connection. +By default, this is enabled for all applications; by setting this +sysctl to 0, only applications that specifically request keepalives +will use them. +In most environments, TCP keepalives will improve the management of +system state by expiring dead TCP connections, particularly for +systems serving dialup users who may not always terminate individual +TCP connections before disconnecting from the network. +However, in some environments, temporary network outages may be +incorrectly identified as dead sessions, resulting in unexpectedly +terminated TCP connections. +In such environments, setting the sysctl to 0 may reduce the occurrence of +TCP session disconnections. +.Pp +The +.Va net.inet.tcp.delayed_ack +TCP feature is largely misunderstood. +Historically speaking, this feature +was designed to allow the acknowledgement to transmitted data to be returned +along with the response. +For example, when you type over a remote shell, +the acknowledgement to the character you send can be returned along with the +data representing the echo of the character. +With delayed acks turned off, +the acknowledgement may be sent in its own packet, before the remote service +has a chance to echo the data it just received. +This same concept also +applies to any interactive protocol (e.g.\& SMTP, WWW, POP3), and can cut the +number of tiny packets flowing across the network in half. +The +.Fx +delayed ACK implementation also follows the TCP protocol rule that +at least every other packet be acknowledged even if the standard 100ms +timeout has not yet passed. +Normally the worst a delayed ACK can do is +slightly delay the teardown of a connection, or slightly delay the ramp-up +of a slow-start TCP connection. +While we are not sure we believe that +the several FAQs related to packages such as SAMBA and SQUID which advise +turning off delayed acks may be referring to the slow-start issue. +In +.Fx , +it would be more beneficial to increase the slow-start flightsize via +the +.Va net.inet.tcp.slowstart_flightsize +sysctl rather than disable delayed acks. +.Pp +The +.Va net.inet.tcp.inflight.enable +sysctl turns on bandwidth delay product limiting for all TCP connections. +The system will attempt to calculate the bandwidth delay product for each +connection and limit the amount of data queued to the network to just the +amount required to maintain optimum throughput. +This feature is useful +if you are serving data over modems, GigE, or high speed WAN links (or +any other link with a high bandwidth*delay product), especially if you are +also using window scaling or have configured a large send window. +If you enable this option, you should also be sure to set +.Va net.inet.tcp.inflight.debug +to 0 (disable debugging), and for production use setting +.Va net.inet.tcp.inflight.min +to at least 6144 may be beneficial. +Note however, that setting high +minimums may effectively disable bandwidth limiting depending on the link. +The limiting feature reduces the amount of data built up in intermediate +router and switch packet queues as well as reduces the amount of data built +up in the local host's interface queue. +With fewer packets queued up, +interactive connections, especially over slow modems, will also be able +to operate with lower round trip times. +However, note that this feature +only affects data transmission (uploading / server-side). +It does not +affect data reception (downloading). +.Pp +Adjusting +.Va net.inet.tcp.inflight.stab +is not recommended. +This parameter defaults to 20, representing 2 maximal packets added +to the bandwidth delay product window calculation. +The additional +window is required to stabilize the algorithm and improve responsiveness +to changing conditions, but it can also result in higher ping times +over slow links (though still much lower than you would get without +the inflight algorithm). +In such cases you may +wish to try reducing this parameter to 15, 10, or 5, and you may also +have to reduce +.Va net.inet.tcp.inflight.min +(for example, to 3500) to get the desired effect. +Reducing these parameters +should be done as a last resort only. +.Pp +The +.Va net.inet.ip.portrange.* +sysctls control the port number ranges automatically bound to TCP and UDP +sockets. +There are three ranges: a low range, a default range, and a +high range, selectable via the +.Dv IP_PORTRANGE +.Xr setsockopt 2 +call. +Most +network programs use the default range which is controlled by +.Va net.inet.ip.portrange.first +and +.Va net.inet.ip.portrange.last , +which default to 49152 and 65535, respectively. +Bound port ranges are +used for outgoing connections, and it is possible to run the system out +of ports under certain circumstances. +This most commonly occurs when you are +running a heavily loaded web proxy. +The port range is not an issue +when running a server which handles mainly incoming connections, such as a +normal web server, or has a limited number of outgoing connections, such +as a mail relay. +For situations where you may run out of ports, +we recommend decreasing +.Va net.inet.ip.portrange.first +modestly. +A range of 10000 to 30000 ports may be reasonable. +You should also consider firewall effects when changing the port range. +Some firewalls +may block large ranges of ports (usually low-numbered ports) and expect systems +to use higher ranges of ports for outgoing connections. +By default +.Va net.inet.ip.portrange.last +is set at the maximum allowable port number. +.Pp +The +.Va kern.ipc.somaxconn +sysctl limits the size of the listen queue for accepting new TCP connections. +The default value of 128 is typically too low for robust handling of new +connections in a heavily loaded web server environment. +For such environments, +we recommend increasing this value to 1024 or higher. +The service daemon +may itself limit the listen queue size (e.g.\& +.Xr sendmail 8 , +apache) but will +often have a directive in its configuration file to adjust the queue size up. +Larger listen queues also do a better job of fending off denial of service +attacks. +.Pp +The +.Va kern.maxfiles +sysctl determines how many open files the system supports. +The default is +typically a few thousand but you may need to bump this up to ten or twenty +thousand if you are running databases or large descriptor-heavy daemons. +The read-only +.Va kern.openfiles +sysctl may be interrogated to determine the current number of open files +on the system. +.Pp +The +.Va vm.swap_idle_enabled +sysctl is useful in large multi-user systems where you have lots of users +entering and leaving the system and lots of idle processes. +Such systems +tend to generate a great deal of continuous pressure on free memory reserves. +Turning this feature on and adjusting the swapout hysteresis (in idle +seconds) via +.Va vm.swap_idle_threshold1 +and +.Va vm.swap_idle_threshold2 +allows you to depress the priority of pages associated with idle processes +more quickly then the normal pageout algorithm. +This gives a helping hand +to the pageout daemon. +Do not turn this option on unless you need it, +because the tradeoff you are making is to essentially pre-page memory sooner +rather than later, eating more swap and disk bandwidth. +In a small system +this option will have a detrimental effect but in a large system that is +already doing moderate paging this option allows the VM system to stage +whole processes into and out of memory more easily. +.Sh LOADER TUNABLES +Some aspects of the system behavior may not be tunable at runtime because +memory allocations they perform must occur early in the boot process. +To change loader tunables, you must set their values in +.Xr loader.conf 5 +and reboot the system. +.Pp +.Va kern.maxusers +controls the scaling of a number of static system tables, including defaults +for the maximum number of open files, sizing of network memory resources, etc. +As of +.Fx 4.5 , +.Va kern.maxusers +is automatically sized at boot based on the amount of memory available in +the system, and may be determined at run-time by inspecting the value of the +read-only +.Va kern.maxusers +sysctl. +Some sites will require larger or smaller values of +.Va kern.maxusers +and may set it as a loader tunable; values of 64, 128, and 256 are not +uncommon. +We do not recommend going above 256 unless you need a huge number +of file descriptors; many of the tunable values set to their defaults by +.Va kern.maxusers +may be individually overridden at boot-time or run-time as described +elsewhere in this document. +Systems older than +.Fx 4.4 +must set this value via the kernel +.Xr config 8 +option +.Cd maxusers +instead. +.Pp +The +.Va kern.dfldsiz +and +.Va kern.dflssiz +tunables set the default soft limits for process data and stack size +respectively. +Processes may increase these up to the hard limits by calling +.Xr setrlimit 2 . +The +.Va kern.maxdsiz , +.Va kern.maxssiz , +and +.Va kern.maxtsiz +tunables set the hard limits for process data, stack, and text size +respectively; processes may not exceed these limits. +The +.Va kern.sgrowsiz +tunable controls how much the stack segment will grow when a process +needs to allocate more stack. +.Pp +.Va kern.ipc.nmbclusters +may be adjusted to increase the number of network mbufs the system is +willing to allocate. +Each cluster represents approximately 2K of memory, +so a value of 1024 represents 2M of kernel memory reserved for network +buffers. +You can do a simple calculation to figure out how many you need. +If you have a web server which maxes out at 1000 simultaneous connections, +and each connection eats a 16K receive and 16K send buffer, you need +approximately 32MB worth of network buffers to deal with it. +A good rule of +thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768. +So for this case +you would want to set +.Va kern.ipc.nmbclusters +to 32768. +We recommend values between +1024 and 4096 for machines with moderates amount of memory, and between 4096 +and 32768 for machines with greater amounts of memory. +Under no circumstances +should you specify an arbitrarily high value for this parameter, it could +lead to a boot-time crash. +The +.Fl m +option to +.Xr netstat 1 +may be used to observe network cluster use. +Older versions of +.Fx +do not have this tunable and require that the +kernel +.Xr config 8 +option +.Dv NMBCLUSTERS +be set instead. +.Pp +More and more programs are using the +.Xr sendfile 2 +system call to transmit files over the network. +The +.Va kern.ipc.nsfbufs +sysctl controls the number of file system buffers +.Xr sendfile 2 +is allowed to use to perform its work. +This parameter nominally scales +with +.Va kern.maxusers +so you should not need to modify this parameter except under extreme +circumstances. +See the +.Sx TUNING +section in the +.Xr sendfile 2 +manual page for details. +.Sh KERNEL CONFIG TUNING +There are a number of kernel options that you may have to fiddle with in +a large-scale system. +In order to change these options you need to be +able to compile a new kernel from source. +The +.Xr config 8 +manual page and the handbook are good starting points for learning how to +do this. +Generally the first thing you do when creating your own custom +kernel is to strip out all the drivers and services you do not use. +Removing things like +.Dv INET6 +and drivers you do not have will reduce the size of your kernel, sometimes +by a megabyte or more, leaving more memory available for applications. +.Pp +.Dv SCSI_DELAY +may be used to reduce system boot times. +The defaults are fairly high and +can be responsible for 5+ seconds of delay in the boot process. +Reducing +.Dv SCSI_DELAY +to something below 5 seconds could work (especially with modern drives). +.Pp +There are a number of +.Dv *_CPU +options that can be commented out. +If you only want the kernel to run +on a Pentium class CPU, you can easily remove +.Dv I486_CPU , +but only remove +.Dv I586_CPU +if you are sure your CPU is being recognized as a Pentium II or better. +Some clones may be recognized as a Pentium or even a 486 and not be able +to boot without those options. +If it works, great! +The operating system +will be able to better use higher-end CPU features for MMU, task switching, +timebase, and even device operations. +Additionally, higher-end CPUs support +4MB MMU pages, which the kernel uses to map the kernel itself into memory, +increasing its efficiency under heavy syscall loads. +.Sh IDE WRITE CACHING +.Fx 4.3 +flirted with turning off IDE write caching. +This reduced write bandwidth +to IDE disks but was considered necessary due to serious data consistency +issues introduced by hard drive vendors. +Basically the problem is that +IDE drives lie about when a write completes. +With IDE write caching turned +on, IDE hard drives will not only write data to disk out of order, they +will sometimes delay some of the blocks indefinitely under heavy disk +load. +A crash or power failure can result in serious file system +corruption. +So our default was changed to be safe. +Unfortunately, the +result was such a huge loss in performance that we caved in and changed the +default back to on after the release. +You should check the default on +your system by observing the +.Va hw.ata.wc +sysctl variable. +If IDE write caching is turned off, you can turn it back +on by setting the +.Va hw.ata.wc +loader tunable to 1. +More information on tuning the ATA driver system may be found in the +.Xr ata 4 +manual page. +If you need performance, go with SCSI. +.Sh CPU, MEMORY, DISK, NETWORK +The type of tuning you do depends heavily on where your system begins to +bottleneck as load increases. +If your system runs out of CPU (idle times +are perpetually 0%) then you need to consider upgrading the CPU or moving to +an SMP motherboard (multiple CPU's), or perhaps you need to revisit the +programs that are causing the load and try to optimize them. +If your system +is paging to swap a lot you need to consider adding more memory. +If your +system is saturating the disk you typically see high CPU idle times and +total disk saturation. +.Xr systat 1 +can be used to monitor this. +There are many solutions to saturated disks: +increasing memory for caching, mirroring disks, distributing operations across +several machines, and so forth. +If disk performance is an issue and you +are using IDE drives, switching to SCSI can help a great deal. +While modern +IDE drives compare with SCSI in raw sequential bandwidth, the moment you +start seeking around the disk SCSI drives usually win. +.Pp +Finally, you might run out of network suds. +The first line of defense for +improving network performance is to make sure you are using switches instead +of hubs, especially these days where switches are almost as cheap. +Hubs +have severe problems under heavy loads due to collision back-off and one bad +host can severely degrade the entire LAN. +Second, optimize the network path +as much as possible. +For example, in +.Xr firewall 7 +we describe a firewall protecting internal hosts with a topology where +the externally visible hosts are not routed through it. +Use 100BaseT rather +than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs. +Most bottlenecks occur at the WAN link (e.g.\& +modem, T1, DSL, whatever). +If expanding the link is not an option it may be possible to use the +.Xr dummynet 4 +feature to implement peak shaving or other forms of traffic shaping to +prevent the overloaded service (such as web services) from affecting other +services (such as email), or vice versa. +In home installations this could +be used to give interactive traffic (your browser, +.Xr ssh 1 +logins) priority +over services you export from your box (web services, email). +.Sh SEE ALSO +.Xr netstat 1 , +.Xr systat 1 , +.Xr sendfile 2 , +.Xr ata 4 , +.Xr dummynet 4 , +.Xr login.conf 5 , +.Xr rc.conf 5 , +.Xr sysctl.conf 5 , +.Xr firewall 7 , +.Xr hier 7 , +.Xr ports 7 , +.Xr boot 8 , +.Xr bsdlabel 8 , +.Xr ccdconfig 8 , +.Xr config 8 , +.Xr fsck 8 , +.Xr gjournal 8 , +.Xr gstripe 8 , +.Xr gvinum 8 , +.Xr ifconfig 8 , +.Xr ipfw 8 , +.Xr loader 8 , +.Xr mount 8 , +.Xr newfs 8 , +.Xr route 8 , +.Xr sysctl 8 , +.Xr sysinstall 8 , +.Xr tunefs 8 +.Sh HISTORY +The +.Nm +manual page was originally written by +.An Matthew Dillon +and first appeared +in +.Fx 4.3 , +May 2001. |