diff options
Diffstat (limited to 'share/doc/papers/malloc/performance.ms')
-rw-r--r-- | share/doc/papers/malloc/performance.ms | 113 |
1 files changed, 113 insertions, 0 deletions
diff --git a/share/doc/papers/malloc/performance.ms b/share/doc/papers/malloc/performance.ms new file mode 100644 index 000000000000..773f92ab7832 --- /dev/null +++ b/share/doc/papers/malloc/performance.ms @@ -0,0 +1,113 @@ +.\" +.\" ---------------------------------------------------------------------------- +.\" "THE BEER-WARE LICENSE" (Revision 42): +.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you +.\" can do whatever you want with this stuff. If we meet some day, and you think +.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp +.\" ---------------------------------------------------------------------------- +.\" +.\" $FreeBSD$ +.\" +.ds RH Performance +.NH +Performance +.PP +Performance for a malloc(3) implementation comes as two variables: +.IP +A: How much time does it use for searching and manipulating data structures. +We will refer to this as ``overhead time''. +.IP +B: How well does it manage the storage. +This rather vague metric we call ``quality of allocation''. +.PP +The overhead time is easy to measure, just do a lot of malloc/free calls +of various kinds and combination, and compare the results. +.PP +The quality of allocation is not quite as simple as that. +One measure of quality is the size of the process, that should obviously +be minimized. +Another measure is the execution time of the process. +This is not an obvious indicator of quality, but people will generally +agree that it should be minimized as well, and if malloc(3) can do +anything to do so, it should. +Explanation why it is still a good metric follows: +.PP +In a traditional segment/swap kernel, the desirable behavior of a process +is to keep the brk(2) as low as possible, thus minimizing the size of the +data/bss/heap segment, which in turn translates to a smaller process and +a smaller probability of the process being swapped out, qed: faster +execution time as an average. +.PP +In a paging environment this is not a bad choice for a default, but +a couple of details needs to be looked at much more carefully. +.PP +First of all, the size of a process becomes a more vague concept since +only the pages that are actually used need to be in primary storage +for execution to progress, and they only need to be there when used. +That implies that many more processes can fit in the same amount of +primary storage, since most processes have a high degree of locality +of reference and thus only need some fraction of their pages to actually +do their job. +.PP +From this it follows that the interesting size of the process, is some +subset of the total amount of virtual memory occupied by the process. +This number isn't a constant, it varies depending on the whereabouts +of the process, and it may indeed fluctuate wildly over the lifetime +of the process. +.PP +One of the names for this vague concept is ``current working set''. +It has been defined many different ways over the years, mostly to +satisfy and support claims in marketing or benchmark contexts. +.PP +For now we can simply say that it is the number of pages the process +needs in order to run at a sufficiently low paging rate in a congested +primary storage. +(If primary storage isn't congested, this is not really important +of course, but most systems would be better off using the pages for +disk-cache or similar functions, so from that perspective it will +always be congested.) +If the number of pages is too small, the process will wait for its +pages to be read from secondary storage much of the time, if it's too +big, the space could be used better for something else. +.PP +From the view of any single process, this number of pages is +"all of my pages", but from the point of view of the OS it should +be tuned to maximise the total throughput of all the processes on +the machine at the time. +This is usually done using various kinds of least-recently-used +replacement algorithms to select page candidates for replacement. +.PP +With this knowledge, can we decide what the performance goal is for +a modern malloc(3) ? +Well, it's almost as simple as it used to be: +.B +Minimize the number of pages accessed. +.R +.PP +This really is the core of it all. +If the number of accessed pages is smaller, then locality of reference is +higher, and all kinds of caches (which is essentially what the +primary storage is in a VM system) work better. +.PP +It's interesting to notice that the classical malloc fails on this one +because the information about free chunks is kept with the free +chunks themselves. In some of the benchmarks this came out as all the +pages being paged in every time a malloc call was made, because malloc +had to traverse the free list to find a suitable chunk for the allocation. +If memory is not in use, then you shouldn't access it. +.PP +The secondary goal is more evident: +.B +Try to work in pages. +.R +.PP +That makes it easier for the kernel, and wastes less virtual memory. +Most modern implementations do this when they interact with the +kernel, but few try to avoid objects spanning pages. +.PP +If an object's size +is less than or equal to a page, there is no reason for it to span two pages. +Having objects span pages means that two pages must be +paged in, if that object is accessed. +.PP +With this analysis in the luggage, we can start coding. |