1 files changed, 997 insertions, 0 deletions
diff --git a/share/man/man7/tuning.7 b/share/man/man7/tuning.7
new file mode 100644
index 000000000000..9041181984c3
--- /dev/null
+++ b/share/man/man7/tuning.7
@@ -0,0 +1,997 @@
+.\" Copyright (C) 2001 Matthew Dillon. All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd January 23, 2009
+.Dt TUNING 7
+.Os
+.Sh NAME
+.Nm tuning
+.Nd performance tuning under FreeBSD
+.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
+When using
+.Xr bsdlabel 8
+or
+.Xr sysinstall 8
+to lay out your file systems on a hard disk it is important to remember
+that hard drives can transfer data much more quickly from outer tracks
+than they can from inner tracks.
+To take advantage of this you should
+try to pack your smaller file systems and swap closer to the outer tracks,
+follow with the larger file systems, and end with the largest file systems.
+It is also important to size system standard file systems such that you
+will not be forced to resize them later as you scale the machine up.
+I usually create, in order, a 128M root, 1G swap, 128M
+.Pa /var ,
+128M
+.Pa /var/tmp ,
+3G
+.Pa /usr ,
+and use any remaining space for
+.Pa /home .
+.Pp
+You should typically size your swap space to approximately 2x main memory
+for systems with less than 2GB of RAM, or approximately 1x main memory
+if you have more.
+If you do not have a lot of RAM, though, you will generally want a lot
+more swap.
+It is not recommended that you configure any less than
+256M of swap on a system and you should keep in mind future memory
+expansion when sizing the swap partition.
+The kernel's VM paging algorithms are tuned to perform best when there is
+at least 2x swap versus main memory.
+Configuring too little swap can lead
+to inefficiencies in the VM page scanning code as well as create issues
+later on if you add more memory to your machine.
+Finally, on larger systems
+with multiple SCSI disks (or multiple IDE disks operating on different
+controllers), we strongly recommend that you configure swap on each drive.
+The swap partitions on the drives should be approximately the same size.
+The kernel can handle arbitrary sizes but
+internal data structures scale to 4 times the largest swap partition.
+Keeping
+the swap partitions near the same size will allow the kernel to optimally
+stripe swap space across the N disks.
+Do not worry about overdoing it a
+little, swap space is the saving grace of
+.Ux
+and even if you do not normally use much swap, it can give you more time to
+recover from a runaway program before being forced to reboot.
+.Pp
+How you size your
+.Pa /var
+partition depends heavily on what you intend to use the machine for.
+This
+partition is primarily used to hold mailboxes, the print spool, and log
+files.
+Some people even make
+.Pa /var/log
+its own partition (but except for extreme cases it is not worth the waste
+of a partition ID).
+If your machine is intended to act as a mail
+or print server,
+or you are running a heavily visited web server, you should consider
+creating a much larger partition \(en perhaps a gig or more.
+It is very easy
+to underestimate log file storage requirements.
+.Pp
+Sizing
+.Pa /var/tmp
+depends on the kind of temporary file usage you think you will need.
+128M is
+the minimum we recommend.
+Also note that sysinstall will create a
+.Pa /tmp
+directory.
+Dedicating a partition for temporary file storage is important for
+two reasons: first, it reduces the possibility of file system corruption
+in a crash, and second it reduces the chance of a runaway process that
+fills up
+.Oo Pa /var Oc Ns Pa /tmp
+from blowing up more critical subsystems (mail,
+logging, etc).
+Filling up
+.Oo Pa /var Oc Ns Pa /tmp
+is a very common problem to have.
+.Pp
+In the old days there were differences between
+.Pa /tmp
+and
+.Pa /var/tmp ,
+but the introduction of
+.Pa /var
+(and
+.Pa /var/tmp )
+led to massive confusion
+by program writers so today programs haphazardly use one or the
+other and thus no real distinction can be made between the two.
+So it makes sense to have just one temporary directory and
+softlink to it from the other
+.Pa tmp
+directory locations.
+However you handle
+.Pa /tmp ,
+the one thing you do not want to do is leave it sitting
+on the root partition where it might cause root to fill up or possibly
+corrupt root in a crash/reboot situation.
+.Pp
+The
+.Pa /usr
+partition holds the bulk of the files required to support the system and
+a subdirectory within it called
+.Pa /usr/local
+holds the bulk of the files installed from the
+.Xr ports 7
+hierarchy.
+If you do not use ports all that much and do not intend to keep
+system source
+.Pq Pa /usr/src
+on the machine, you can get away with
+a 1 gigabyte
+.Pa /usr
+partition.
+However, if you install a lot of ports
+(especially window managers and Linux-emulated binaries), we recommend
+at least a 2 gigabyte
+.Pa /usr
+and if you also intend to keep system source
+on the machine, we recommend a 3 gigabyte
+.Pa /usr .
+Do not underestimate the
+amount of space you will need in this partition, it can creep up and
+surprise you!
+.Pp
+The
+.Pa /home
+partition is typically used to hold user-specific data.
+I usually size it to the remainder of the disk.
+.Pp
+Why partition at all?
+Why not create one big
+.Pa /
+partition and be done with it?
+Then I do not have to worry about undersizing things!
+Well, there are several reasons this is not a good idea.
+First,
+each partition has different operational characteristics and separating them
+allows the file system to tune itself to those characteristics.
+For example,
+the root and
+.Pa /usr
+partitions are read-mostly, with very little writing, while
+a lot of reading and writing could occur in
+.Pa /var
+and
+.Pa /var/tmp .
+By properly
+partitioning your system fragmentation introduced in the smaller more
+heavily write-loaded partitions will not bleed over into the mostly-read
+partitions.
+Additionally, keeping the write-loaded partitions closer to
+the edge of the disk (i.e., before the really big partitions instead of after
+in the partition table) will increase I/O performance in the partitions
+where you need it the most.
+Now it is true that you might also need I/O
+performance in the larger partitions, but they are so large that shifting
+them more towards the edge of the disk will not lead to a significant
+performance improvement whereas moving
+.Pa /var
+to the edge can have a huge impact.
+Finally, there are safety concerns.
+Having a small neat root partition that
+is essentially read-only gives it a greater chance of surviving a bad crash
+intact.
+.Pp
+Properly partitioning your system also allows you to tune
+.Xr newfs 8 ,
+and
+.Xr tunefs 8
+parameters.
+Tuning
+.Xr newfs 8
+requires more experience but can lead to significant improvements in
+performance.
+There are three parameters that are relatively safe to tune:
+.Em blocksize , bytes/i-node ,
+and
+.Em cylinders/group .
+.Pp
+.Fx
+performs best when using 8K or 16K file system block sizes.
+The default file system block size is 16K,
+which provides best performance for most applications,
+with the exception of those that perform random access on large files
+(such as database server software).
+Such applications tend to perform better with a smaller block size,
+although modern disk characteristics are such that the performance
+gain from using a smaller block size may not be worth consideration.
+Using a block size larger than 16K
+can cause fragmentation of the buffer cache and
+lead to lower performance.
+.Pp
+The defaults may be unsuitable
+for a file system that requires a very large number of i-nodes
+or is intended to hold a large number of very small files.
+Such a file system should be created with an 8K or 4K block size.
+This also requires you to specify a smaller
+fragment size.
+We recommend always using a fragment size that is 1/8
+the block size (less testing has been done on other fragment size factors).
+The
+.Xr newfs 8
+options for this would be
+.Dq Li "newfs -f 1024 -b 8192 ..." .
+.Pp
+If a large partition is intended to be used to hold fewer, larger files, such
+as database files, you can increase the
+.Em bytes/i-node
+ratio which reduces the number of i-nodes (maximum number of files and
+directories that can be created) for that partition.
+Decreasing the number
+of i-nodes in a file system can greatly reduce
+.Xr fsck 8
+recovery times after a crash.
+Do not use this option
+unless you are actually storing large files on the partition, because if you
+overcompensate you can wind up with a file system that has lots of free
+space remaining but cannot accommodate any more files.
+Using 32768, 65536, or 262144 bytes/i-node is recommended.
+You can go higher but
+it will have only incremental effects on
+.Xr fsck 8
+recovery times.
+For example,
+.Dq Li "newfs -i 32768 ..." .
+.Pp
+.Xr tunefs 8
+may be used to further tune a file system.
+This command can be run in
+single-user mode without having to reformat the file system.
+However, this is possibly the most abused program in the system.
+Many people attempt to
+increase available file system space by setting the min-free percentage to 0.
+This can lead to severe file system fragmentation and we do not recommend
+that you do this.
+Really the only
+.Xr tunefs 8
+option worthwhile here is turning on
+.Em softupdates
+with
+.Dq Li "tunefs -n enable /filesystem" .
+(Note: in
+.Fx 4.5
+and later, softupdates can be turned on using the
+.Fl U
+option to
+.Xr newfs 8 ,
+and
+.Xr sysinstall 8
+will typically enable softupdates automatically for non-root file systems).
+Softupdates drastically improves meta-data performance, mainly file
+creation and deletion.
+We recommend enabling softupdates on most file systems; however, there
+are two limitations to softupdates that you should be aware of when
+determining whether to use it on a file system.
+First, softupdates guarantees file system consistency in the
+case of a crash but could very easily be several seconds (even a minute!\&)
+behind on pending write to the physical disk.
+If you crash you may lose more work
+than otherwise.
+Secondly, softupdates delays the freeing of file system
+blocks.
+If you have a file system (such as the root file system) which is
+close to full, doing a major update of it, e.g.\&
+.Dq Li "make installworld" ,
+can run it out of space and cause the update to fail.
+For this reason, softupdates will not be enabled on the root file system
+during a typical install.
+There is no loss of performance since the root
+file system is rarely written to.
+.Pp
+A number of run-time
+.Xr mount 8
+options exist that can help you tune the system.
+The most obvious and most dangerous one is
+.Cm async .
+Only use this option in conjunction with
+.Xr gjournal 8 ,
+as it is far too dangerous on a normal file system.
+A less dangerous and more
+useful
+.Xr mount 8
+option is called
+.Cm noatime .
+.Ux
+file systems normally update the last-accessed time of a file or
+directory whenever it is accessed.
+This operation is handled in
+.Fx
+with a delayed write and normally does not create a burden on the system.
+However, if your system is accessing a huge number of files on a continuing
+basis the buffer cache can wind up getting polluted with atime updates,
+creating a burden on the system.
+For example, if you are running a heavily
+loaded web site, or a news server with lots of readers, you might want to
+consider turning off atime updates on your larger partitions with this
+.Xr mount 8
+option.
+However, you should not gratuitously turn off atime
+updates everywhere.
+For example, the
+.Pa /var
+file system customarily
+holds mailboxes, and atime (in combination with mtime) is used to
+determine whether a mailbox has new mail.
+You might as well leave
+atime turned on for mostly read-only partitions such as
+.Pa /
+and
+.Pa /usr
+as well.
+This is especially useful for
+.Pa /
+since some system utilities
+use the atime field for reporting.
+.Sh STRIPING DISKS
+In larger systems you can stripe partitions from several drives together
+to create a much larger overall partition.
+Striping can also improve
+the performance of a file system by splitting I/O operations across two
+or more disks.
+The
+.Xr gstripe 8 ,
+.Xr gvinum 8 ,
+and
+.Xr ccdconfig 8
+utilities may be used to create simple striped file systems.
+Generally
+speaking, striping smaller partitions such as the root and
+.Pa /var/tmp ,
+or essentially read-only partitions such as
+.Pa /usr
+is a complete waste of time.
+You should only stripe partitions that require serious I/O performance,
+typically
+.Pa /var , /home ,
+or custom partitions used to hold databases and web pages.
+Choosing the proper stripe size is also
+important.
+File systems tend to store meta-data on power-of-2 boundaries
+and you usually want to reduce seeking rather than increase seeking.
+This
+means you want to use a large off-center stripe size such as 1152 sectors
+so sequential I/O does not seek both disks and so meta-data is distributed
+across both disks rather than concentrated on a single disk.
+If
+you really need to get sophisticated, we recommend using a real hardware
+RAID controller from the list of
+.Fx
+supported controllers.
+.Sh SYSCTL TUNING
+.Xr sysctl 8
+variables permit system behavior to be monitored and controlled at
+run-time.
+Some sysctls simply report on the behavior of the system; others allow
+the system behavior to be modified;
+some may be set at boot time using
+.Xr rc.conf 5 ,
+but most will be set via
+.Xr sysctl.conf 5 .
+There are several hundred sysctls in the system, including many that appear
+to be candidates for tuning but actually are not.
+In this document we will only cover the ones that have the greatest effect
+on the system.
+.Pp
+The
+.Va kern.ipc.maxpipekva
+loader tunable is used to set a hard limit on the
+amount of kernel address space allocated to mapping of pipe buffers.
+Use of the mapping allows the kernel to eliminate a copy of the
+data from writer address space into the kernel, directly copying
+the content of mapped buffer to the reader.
+Increasing this value to a higher setting, such as `25165824' might
+improve performance on systems where space for mapping pipe buffers
+is quickly exhausted.
+This exhaustion is not fatal; however, and it will only cause pipes to
+to fall back to using double-copy.
+.Pp
+The
+.Va kern.ipc.shm_use_phys
+sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on).
+Setting
+this parameter to 1 will cause all System V shared memory segments to be
+mapped to unpageable physical RAM.
+This feature only has an effect if you
+are either (A) mapping small amounts of shared memory across many (hundreds)
+of processes, or (B) mapping large amounts of shared memory across any
+number of processes.
+This feature allows the kernel to remove a great deal
+of internal memory management page-tracking overhead at the cost of wiring
+the shared memory into core, making it unswappable.
+.Pp
+The
+.Va vfs.vmiodirenable
+sysctl defaults to 1 (on).
+This parameter controls how directories are cached
+by the system.
+Most directories are small and use but a single fragment
+(typically 1K) in the file system and even less (typically 512 bytes) in
+the buffer cache.
+However, when operating in the default mode the buffer
+cache will only cache a fixed number of directories even if you have a huge
+amount of memory.
+Turning on this sysctl allows the buffer cache to use
+the VM Page Cache to cache the directories.
+The advantage is that all of
+memory is now available for caching directories.
+The disadvantage is that
+the minimum in-core memory used to cache a directory is the physical page
+size (typically 4K) rather than 512 bytes.
+We recommend turning this option off in memory-constrained environments;
+however, when on, it will substantially improve the performance of services
+that manipulate a large number of files.
+Such services can include web caches, large mail systems, and news systems.
+Turning on this option will generally not reduce performance even with the
+wasted memory but you should experiment to find out.
+.Pp
+The
+.Va vfs.write_behind
+sysctl defaults to 1 (on).
+This tells the file system to issue media
+writes as full clusters are collected, which typically occurs when writing
+large sequential files.
+The idea is to avoid saturating the buffer
+cache with dirty buffers when it would not benefit I/O performance.
+However,
+this may stall processes and under certain circumstances you may wish to turn
+it off.
+.Pp
+The
+.Va vfs.hirunningspace
+sysctl determines how much outstanding write I/O may be queued to
+disk controllers system-wide at any given instance.
+The default is
+usually sufficient but on machines with lots of disks you may want to bump
+it up to four or five megabytes.
+Note that setting too high a value
+(exceeding the buffer cache's write threshold) can lead to extremely
+bad clustering performance.
+Do not set this value arbitrarily high!
+Also,
+higher write queueing values may add latency to reads occurring at the same
+time.
+.Pp
+There are various other buffer-cache and VM page cache related sysctls.
+We do not recommend modifying these values.
+As of
+.Fx 4.3 ,
+the VM system does an extremely good job tuning itself.
+.Pp
+The
+.Va net.inet.tcp.sendspace
+and
+.Va net.inet.tcp.recvspace
+sysctls are of particular interest if you are running network intensive
+applications.
+They control the amount of send and receive buffer space
+allowed for any given TCP connection.
+The default sending buffer is 32K; the default receiving buffer
+is 64K.
+You can often
+improve bandwidth utilization by increasing the default at the cost of
+eating up more kernel memory for each connection.
+We do not recommend
+increasing the defaults if you are serving hundreds or thousands of
+simultaneous connections because it is possible to quickly run the system
+out of memory due to stalled connections building up.
+But if you need
+high bandwidth over a fewer number of connections, especially if you have
+gigabit Ethernet, increasing these defaults can make a huge difference.
+You can adjust the buffer size for incoming and outgoing data separately.
+For example, if your machine is primarily doing web serving you may want
+to decrease the recvspace in order to be able to increase the
+sendspace without eating too much kernel memory.
+Note that the routing table (see
+.Xr route 8 )
+can be used to introduce route-specific send and receive buffer size
+defaults.
+.Pp
+As an additional management tool you can use pipes in your
+firewall rules (see
+.Xr ipfw 8 )
+to limit the bandwidth going to or from particular IP blocks or ports.
+For example, if you have a T1 you might want to limit your web traffic
+to 70% of the T1's bandwidth in order to leave the remainder available
+for mail and interactive use.
+Normally a heavily loaded web server
+will not introduce significant latencies into other services even if
+the network link is maxed out, but enforcing a limit can smooth things
+out and lead to longer term stability.
+Many people also enforce artificial
+bandwidth limitations in order to ensure that they are not charged for
+using too much bandwidth.
+.Pp
+Setting the send or receive TCP buffer to values larger than 65535 will result
+in a marginal performance improvement unless both hosts support the window
+scaling extension of the TCP protocol, which is controlled by the
+.Va net.inet.tcp.rfc1323
+sysctl.
+These extensions should be enabled and the TCP buffer size should be set
+to a value larger than 65536 in order to obtain good performance from
+certain types of network links; specifically, gigabit WAN links and
+high-latency satellite links.
+RFC1323 support is enabled by default.
+.Pp
+The
+.Va net.inet.tcp.always_keepalive
+sysctl determines whether or not the TCP implementation should attempt
+to detect dead TCP connections by intermittently delivering
+.Dq keepalives
+on the connection.
+By default, this is enabled for all applications; by setting this
+sysctl to 0, only applications that specifically request keepalives
+will use them.
+In most environments, TCP keepalives will improve the management of
+system state by expiring dead TCP connections, particularly for
+systems serving dialup users who may not always terminate individual
+TCP connections before disconnecting from the network.
+However, in some environments, temporary network outages may be
+incorrectly identified as dead sessions, resulting in unexpectedly
+terminated TCP connections.
+In such environments, setting the sysctl to 0 may reduce the occurrence of
+TCP session disconnections.
+.Pp
+The
+.Va net.inet.tcp.delayed_ack
+TCP feature is largely misunderstood.
+Historically speaking, this feature
+was designed to allow the acknowledgement to transmitted data to be returned
+along with the response.
+For example, when you type over a remote shell,
+the acknowledgement to the character you send can be returned along with the
+data representing the echo of the character.
+With delayed acks turned off,
+the acknowledgement may be sent in its own packet, before the remote service
+has a chance to echo the data it just received.
+This same concept also
+applies to any interactive protocol (e.g.\& SMTP, WWW, POP3), and can cut the
+number of tiny packets flowing across the network in half.
+The
+.Fx
+delayed ACK implementation also follows the TCP protocol rule that
+at least every other packet be acknowledged even if the standard 100ms
+timeout has not yet passed.
+Normally the worst a delayed ACK can do is
+slightly delay the teardown of a connection, or slightly delay the ramp-up
+of a slow-start TCP connection.
+While we are not sure we believe that
+the several FAQs related to packages such as SAMBA and SQUID which advise
+turning off delayed acks may be referring to the slow-start issue.
+In
+.Fx ,
+it would be more beneficial to increase the slow-start flightsize via
+the
+.Va net.inet.tcp.slowstart_flightsize
+sysctl rather than disable delayed acks.
+.Pp
+The
+.Va net.inet.tcp.inflight.enable
+sysctl turns on bandwidth delay product limiting for all TCP connections.
+The system will attempt to calculate the bandwidth delay product for each
+connection and limit the amount of data queued to the network to just the
+amount required to maintain optimum throughput.
+This feature is useful
+if you are serving data over modems, GigE, or high speed WAN links (or
+any other link with a high bandwidth*delay product), especially if you are
+also using window scaling or have configured a large send window.
+If you enable this option, you should also be sure to set
+.Va net.inet.tcp.inflight.debug
+to 0 (disable debugging), and for production use setting
+.Va net.inet.tcp.inflight.min
+to at least 6144 may be beneficial.
+Note however, that setting high
+minimums may effectively disable bandwidth limiting depending on the link.
+The limiting feature reduces the amount of data built up in intermediate
+router and switch packet queues as well as reduces the amount of data built
+up in the local host's interface queue.
+With fewer packets queued up,
+interactive connections, especially over slow modems, will also be able
+to operate with lower round trip times.
+However, note that this feature
+only affects data transmission (uploading / server-side).
+It does not
+affect data reception (downloading).
+.Pp
+Adjusting
+.Va net.inet.tcp.inflight.stab
+is not recommended.
+This parameter defaults to 20, representing 2 maximal packets added
+to the bandwidth delay product window calculation.
+The additional
+window is required to stabilize the algorithm and improve responsiveness
+to changing conditions, but it can also result in higher ping times
+over slow links (though still much lower than you would get without
+the inflight algorithm).
+In such cases you may
+wish to try reducing this parameter to 15, 10, or 5, and you may also
+have to reduce
+.Va net.inet.tcp.inflight.min
+(for example, to 3500) to get the desired effect.
+Reducing these parameters
+should be done as a last resort only.
+.Pp
+The
+.Va net.inet.ip.portrange.*
+sysctls control the port number ranges automatically bound to TCP and UDP
+sockets.
+There are three ranges: a low range, a default range, and a
+high range, selectable via the
+.Dv IP_PORTRANGE
+.Xr setsockopt 2
+call.
+Most
+network programs use the default range which is controlled by
+.Va net.inet.ip.portrange.first
+and
+.Va net.inet.ip.portrange.last ,
+which default to 49152 and 65535, respectively.
+Bound port ranges are
+used for outgoing connections, and it is possible to run the system out
+of ports under certain circumstances.
+This most commonly occurs when you are
+running a heavily loaded web proxy.
+The port range is not an issue
+when running a server which handles mainly incoming connections, such as a
+normal web server, or has a limited number of outgoing connections, such
+as a mail relay.
+For situations where you may run out of ports,
+we recommend decreasing
+.Va net.inet.ip.portrange.first
+modestly.
+A range of 10000 to 30000 ports may be reasonable.
+You should also consider firewall effects when changing the port range.
+Some firewalls
+may block large ranges of ports (usually low-numbered ports) and expect systems
+to use higher ranges of ports for outgoing connections.
+By default
+.Va net.inet.ip.portrange.last
+is set at the maximum allowable port number.
+.Pp
+The
+.Va kern.ipc.somaxconn
+sysctl limits the size of the listen queue for accepting new TCP connections.
+The default value of 128 is typically too low for robust handling of new
+connections in a heavily loaded web server environment.
+For such environments,
+we recommend increasing this value to 1024 or higher.
+The service daemon
+may itself limit the listen queue size (e.g.\&
+.Xr sendmail 8 ,
+apache) but will
+often have a directive in its configuration file to adjust the queue size up.
+Larger listen queues also do a better job of fending off denial of service
+attacks.
+.Pp
+The
+.Va kern.maxfiles
+sysctl determines how many open files the system supports.
+The default is
+typically a few thousand but you may need to bump this up to ten or twenty
+thousand if you are running databases or large descriptor-heavy daemons.
+The read-only
+.Va kern.openfiles
+sysctl may be interrogated to determine the current number of open files
+on the system.
+.Pp
+The
+.Va vm.swap_idle_enabled
+sysctl is useful in large multi-user systems where you have lots of users
+entering and leaving the system and lots of idle processes.
+Such systems
+tend to generate a great deal of continuous pressure on free memory reserves.
+Turning this feature on and adjusting the swapout hysteresis (in idle
+seconds) via
+.Va vm.swap_idle_threshold1
+and
+.Va vm.swap_idle_threshold2
+allows you to depress the priority of pages associated with idle processes
+more quickly then the normal pageout algorithm.
+This gives a helping hand
+to the pageout daemon.
+Do not turn this option on unless you need it,
+because the tradeoff you are making is to essentially pre-page memory sooner
+rather than later, eating more swap and disk bandwidth.
+In a small system
+this option will have a detrimental effect but in a large system that is
+already doing moderate paging this option allows the VM system to stage
+whole processes into and out of memory more easily.
+.Sh LOADER TUNABLES
+Some aspects of the system behavior may not be tunable at runtime because
+memory allocations they perform must occur early in the boot process.
+To change loader tunables, you must set their values in
+.Xr loader.conf 5
+and reboot the system.
+.Pp
+.Va kern.maxusers
+controls the scaling of a number of static system tables, including defaults
+for the maximum number of open files, sizing of network memory resources, etc.
+As of
+.Fx 4.5 ,
+.Va kern.maxusers
+is automatically sized at boot based on the amount of memory available in
+the system, and may be determined at run-time by inspecting the value of the
+read-only
+.Va kern.maxusers
+sysctl.
+Some sites will require larger or smaller values of
+.Va kern.maxusers
+and may set it as a loader tunable; values of 64, 128, and 256 are not
+uncommon.
+We do not recommend going above 256 unless you need a huge number
+of file descriptors; many of the tunable values set to their defaults by
+.Va kern.maxusers
+may be individually overridden at boot-time or run-time as described
+elsewhere in this document.
+Systems older than
+.Fx 4.4
+must set this value via the kernel
+.Xr config 8
+option
+.Cd maxusers
+instead.
+.Pp
+The
+.Va kern.dfldsiz
+and
+.Va kern.dflssiz
+tunables set the default soft limits for process data and stack size
+respectively.
+Processes may increase these up to the hard limits by calling
+.Xr setrlimit 2 .
+The
+.Va kern.maxdsiz ,
+.Va kern.maxssiz ,
+and
+.Va kern.maxtsiz
+tunables set the hard limits for process data, stack, and text size
+respectively; processes may not exceed these limits.
+The
+.Va kern.sgrowsiz
+tunable controls how much the stack segment will grow when a process
+needs to allocate more stack.
+.Pp
+.Va kern.ipc.nmbclusters
+may be adjusted to increase the number of network mbufs the system is
+willing to allocate.
+Each cluster represents approximately 2K of memory,
+so a value of 1024 represents 2M of kernel memory reserved for network
+buffers.
+You can do a simple calculation to figure out how many you need.
+If you have a web server which maxes out at 1000 simultaneous connections,
+and each connection eats a 16K receive and 16K send buffer, you need
+approximately 32MB worth of network buffers to deal with it.
+A good rule of
+thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768.
+So for this case
+you would want to set
+.Va kern.ipc.nmbclusters
+to 32768.
+We recommend values between
+1024 and 4096 for machines with moderates amount of memory, and between 4096
+and 32768 for machines with greater amounts of memory.
+Under no circumstances
+should you specify an arbitrarily high value for this parameter, it could
+lead to a boot-time crash.
+The
+.Fl m
+option to
+.Xr netstat 1
+may be used to observe network cluster use.
+Older versions of
+.Fx
+do not have this tunable and require that the
+kernel
+.Xr config 8
+option
+.Dv NMBCLUSTERS
+be set instead.
+.Pp
+More and more programs are using the
+.Xr sendfile 2
+system call to transmit files over the network.
+The
+.Va kern.ipc.nsfbufs
+sysctl controls the number of file system buffers
+.Xr sendfile 2
+is allowed to use to perform its work.
+This parameter nominally scales
+with
+.Va kern.maxusers
+so you should not need to modify this parameter except under extreme
+circumstances.
+See the
+.Sx TUNING
+section in the
+.Xr sendfile 2
+manual page for details.
+.Sh KERNEL CONFIG TUNING
+There are a number of kernel options that you may have to fiddle with in
+a large-scale system.
+In order to change these options you need to be
+able to compile a new kernel from source.
+The
+.Xr config 8
+manual page and the handbook are good starting points for learning how to
+do this.
+Generally the first thing you do when creating your own custom
+kernel is to strip out all the drivers and services you do not use.
+Removing things like
+.Dv INET6
+and drivers you do not have will reduce the size of your kernel, sometimes
+by a megabyte or more, leaving more memory available for applications.
+.Pp
+.Dv SCSI_DELAY
+may be used to reduce system boot times.
+The defaults are fairly high and
+can be responsible for 5+ seconds of delay in the boot process.
+Reducing
+.Dv SCSI_DELAY
+to something below 5 seconds could work (especially with modern drives).
+.Pp
+There are a number of
+.Dv *_CPU
+options that can be commented out.
+If you only want the kernel to run
+on a Pentium class CPU, you can easily remove
+.Dv I486_CPU ,
+but only remove
+.Dv I586_CPU
+if you are sure your CPU is being recognized as a Pentium II or better.
+Some clones may be recognized as a Pentium or even a 486 and not be able
+to boot without those options.
+If it works, great!
+The operating system
+will be able to better use higher-end CPU features for MMU, task switching,
+timebase, and even device operations.
+Additionally, higher-end CPUs support
+4MB MMU pages, which the kernel uses to map the kernel itself into memory,
+increasing its efficiency under heavy syscall loads.
+.Sh IDE WRITE CACHING
+.Fx 4.3
+flirted with turning off IDE write caching.
+This reduced write bandwidth
+to IDE disks but was considered necessary due to serious data consistency
+issues introduced by hard drive vendors.
+Basically the problem is that
+IDE drives lie about when a write completes.
+With IDE write caching turned
+on, IDE hard drives will not only write data to disk out of order, they
+will sometimes delay some of the blocks indefinitely under heavy disk
+load.
+A crash or power failure can result in serious file system
+corruption.
+So our default was changed to be safe.
+Unfortunately, the
+result was such a huge loss in performance that we caved in and changed the
+default back to on after the release.
+You should check the default on
+your system by observing the
+.Va hw.ata.wc
+sysctl variable.
+If IDE write caching is turned off, you can turn it back
+on by setting the
+.Va hw.ata.wc
+loader tunable to 1.
+More information on tuning the ATA driver system may be found in the
+.Xr ata 4
+manual page.
+If you need performance, go with SCSI.
+.Sh CPU, MEMORY, DISK, NETWORK
+The type of tuning you do depends heavily on where your system begins to
+bottleneck as load increases.
+If your system runs out of CPU (idle times
+are perpetually 0%) then you need to consider upgrading the CPU or moving to
+an SMP motherboard (multiple CPU's), or perhaps you need to revisit the
+programs that are causing the load and try to optimize them.
+If your system
+is paging to swap a lot you need to consider adding more memory.
+If your
+system is saturating the disk you typically see high CPU idle times and
+total disk saturation.
+.Xr systat 1
+can be used to monitor this.
+There are many solutions to saturated disks:
+increasing memory for caching, mirroring disks, distributing operations across
+several machines, and so forth.
+If disk performance is an issue and you
+are using IDE drives, switching to SCSI can help a great deal.
+While modern
+IDE drives compare with SCSI in raw sequential bandwidth, the moment you
+start seeking around the disk SCSI drives usually win.
+.Pp
+Finally, you might run out of network suds.
+The first line of defense for
+improving network performance is to make sure you are using switches instead
+of hubs, especially these days where switches are almost as cheap.
+Hubs
+have severe problems under heavy loads due to collision back-off and one bad
+host can severely degrade the entire LAN.
+Second, optimize the network path
+as much as possible.
+For example, in
+.Xr firewall 7
+we describe a firewall protecting internal hosts with a topology where
+the externally visible hosts are not routed through it.
+Use 100BaseT rather
+than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs.
+Most bottlenecks occur at the WAN link (e.g.\&
+modem, T1, DSL, whatever).
+If expanding the link is not an option it may be possible to use the
+.Xr dummynet 4
+feature to implement peak shaving or other forms of traffic shaping to
+prevent the overloaded service (such as web services) from affecting other
+services (such as email), or vice versa.
+In home installations this could
+be used to give interactive traffic (your browser,
+.Xr ssh 1
+logins) priority
+over services you export from your box (web services, email).
+.Sh SEE ALSO
+.Xr netstat 1 ,
+.Xr systat 1 ,
+.Xr sendfile 2 ,
+.Xr ata 4 ,
+.Xr dummynet 4 ,
+.Xr login.conf 5 ,
+.Xr rc.conf 5 ,
+.Xr sysctl.conf 5 ,
+.Xr firewall 7 ,
+.Xr hier 7 ,
+.Xr ports 7 ,
+.Xr boot 8 ,
+.Xr bsdlabel 8 ,
+.Xr ccdconfig 8 ,
+.Xr config 8 ,
+.Xr fsck 8 ,
+.Xr gjournal 8 ,
+.Xr gstripe 8 ,
+.Xr gvinum 8 ,
+.Xr ifconfig 8 ,
+.Xr ipfw 8 ,
+.Xr loader 8 ,
+.Xr mount 8 ,
+.Xr newfs 8 ,
+.Xr route 8 ,
+.Xr sysctl 8 ,
+.Xr sysinstall 8 ,
+.Xr tunefs 8
+.Sh HISTORY
+The
+.Nm
+manual page was originally written by
+.An Matthew Dillon
+and first appeared
+in
+.Fx 4.3 ,
+May 2001.