1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
|
Introduction
This package constitutes the alpha distribution of the soft update
code updates for the fast filesystem.
Status
My `filesystem torture tests' (described below) run for days without
a hitch (no panic's, hangs, filesystem corruption, or memory leaks).
However, I have had several panic's reported to me by folks that
are field testing the code which I have not yet been able to
reproduce or fix. Although these panic's are rare and do not cause
filesystem corruption, the code should only be put into production
on systems where the system administrator is aware that it is being
run, and knows how to turn it off if problems arise. Thus, you may
hand out this code to others, but please ensure that this status
message is included with any distributions. Please also include
the file ffs_softdep.stub.c in any distributions so that folks that
cannot abide by the need to redistribute source will not be left
with a kernel that will not link. It will resolve all the calls
into the soft update code and simply ignores the request to enable
them. Thus you will be able to ensure that your other hooks have
not broken anything and that your kernel is softdep-ready for those
that wish to use them. Please report problems back to me with
kernel backtraces of panics if possible. This is massively complex
code, and people only have to have their filesystems hosed once or
twice to avoid future changes like the plague. I want to find and
fix as many bugs as soon as possible so as to get the code rock
solid before it gets widely released. Please report any bugs that
you uncover to mckusick@mckusick.com.
Performance
Running the Andrew Benchmarks yields the following raw data:
Phase Normal Softdep What it does
1 3s <1s Creating directories
2 8s 4s Copying files
3 6s 6s Recursive directory stats
4 8s 9s Scanning each file
5 25s 25s Compilation
Normal: 19.9u 29.2s 0:52.8 135+630io
Softdep: 20.3u 28.5s 0:47.8 103+363io
Another interesting datapoint are my `filesystem torture tests'.
They consist of 1000 runs of the andrew benchmarks, 1000 copy and
removes of /etc with randomly selected pauses of 0-60 seconds
between each copy and remove, and 500 find from / with randomly
selected pauses of 100 seconds between each run). The run of the
torture test compares as follows:
With soft updates: writes: 6 sync, 1,113,686 async; run time 19hr, 50min
Normal filesystem: writes: 1,459,147 sync, 487,031 async; run time 27hr, 15min
The upshot is 42% less I/O and 28% shorter running time.
Another interesting test point is a full MAKEDEV. Because it runs
as a shell script, it becomes mostly limited by the execution speed
of the machine on which it runs. Here are the numbers:
With soft updates:
labrat# time ./MAKEDEV std
2.2u 32.6s 0:34.82 100.0% 0+0k 11+36io 0pf+0w
labrat# ls | wc
522 522 3317
Without soft updates:
labrat# time ./MAKEDEV std
2.0u 40.5s 0:42.53 100.0% 0+0k 11+1221io 0pf+0w
labrat# ls | wc
522 522 3317
Of course, some of the system time is being pushed
to the syncer process, but that is a different story.
To show a benchmark designed to highlight the soft update code
consider a tar of zero-sized files and an rm -rf of a directory tree
that has at least 50 files or so at each level. Running a test with
a directory tree containing 28 directories holding 202 empty files
produces the following numbers:
With soft updates:
tar: 0.0u 0.5s 0:00.65 76.9% 0+0k 0+44io 0pf+0w (0 sync, 33 async writes)
rm: 0.0u 0.2s 0:00.20 100.0% 0+0k 0+37io 0pf+0w (0 sync, 72 async writes)
Normal filesystem:
tar: 0.0u 1.1s 0:07.27 16.5% 0+0k 60+586io 0pf+0w (523 sync, 0 async writes)
rm: 0.0u 0.5s 0:01.84 29.3% 0+0k 0+318io 0pf+0w (258 sync, 65 async writes)
The large reduction in writes is because inodes are clustered, so
most of a block gets allocated, then the whole block is written
out once rather than having the same block written once for each
inode allocated from it. Similarly each directory block is written
once rather than once for each new directory entry. Effectively
what the update code is doing is allocating a bunch of inodes
and directory entries without writing anything, then ensuring that
the block containing the inodes is written first followed by the
directory block that references them. If there were data in the
files it would further ensure that the data blocks were written
before their inodes claimed them.
Copyright Restrictions
Please familiarize yourself with the copyright restrictions
contained at the top of either the sys/ufs/ffs/softdep.h or
sys/ufs/ffs/ffs_softdep.c file. The key provision is similar
to the one used by the DB 2.0 package and goes as follows:
Redistributions in any form must be accompanied by information
on how to obtain complete source code for any accompanying
software that uses the this software. This source code must
either be included in the distribution or be available for
no more than the cost of distribution plus a nominal fee,
and must be freely redistributable under reasonable
conditions. For an executable file, complete source code
means the source code for all modules it contains. It does
not mean source code for modules or files that typically
accompany the operating system on which the executable file
runs, e.g., standard library modules or system header files.
The idea is to allow those of you freely redistributing your source
to use it while retaining for myself the right to peddle it for
money to the commercial UNIX vendors. Note that I have included a
stub file ffs_softdep.c.stub that is freely redistributable so that
you can put in all the necessary hooks to run the full soft updates
code, but still allow vendors that want to maintain proprietary
source to have a working system. I do plan to release the code with
a `Berkeley style' copyright once I have peddled it around to the
commercial vendors. If you have concerns about this copyright,
feel free to contact me with them and we can try to resolve any
difficulties.
Soft Dependency Operation
The soft update implementation does NOT require ANY changes
to the on-disk format of your filesystems. Furthermore it is
not used by default for any filesystems. It must be enabled on
a filesystem by filesystem basis by running tunefs to set a
bit in the superblock indicating that the filesystem should be
managed using soft updates. If you wish to stop using
soft updates due to performance or reliability reasons,
you can simply run tunefs on it again to turn off the bit and
revert to normal operation. The additional dynamic memory load
placed on the kernel malloc arena is approximately equal to
the amount of memory used by vnodes plus inodes (for a system
with 1000 vnodes, the additional peak memory load is about 300K).
Kernel Changes
There are two new changes to the kernel functionality that are not
contained in in the soft update files. The first is a `trickle
sync' facility running in the kernel as process 3. This trickle
sync process replaces the traditional `update' program (which should
be commented out of the /etc/rc startup script). When a vnode is
first written it is placed 30 seconds down on the trickle sync
queue. If it still exists and has dirty data when it reaches the
top of the queue, it is sync'ed. This approach evens out the load
on the underlying I/O system and avoids writing short-lived files.
The papers on trickle-sync tend to favor aging based on buffers
rather than files. However, I sync on file age rather than buffer
age because the data structures are much smaller as there are
typically far fewer files than buffers. Although this can make the
I/O spikey when a big file times out, it is still much better than
the wholesale sync's that were happening before. It also adapts
much better to the soft update code where I want to control
aging to improve performance (inodes age in 10 seconds, directories
in 15 seconds, files in 30 seconds). This ensures that most
dependencies are gone (e.g., inodes are written when directory
entries want to go to disk) reducing the amount of rollback that
is needed.
The other main kernel change is to split the vnode freelist into
two separate lists. One for vnodes that are still being used to
identify buffers and the other for those vnodes no longer identifying
any buffers. The latter list is used by getnewvnode in preference
to the former.
Packaging of Kernel Changes
The sys subdirectory contains the changes and additions to the
kernel. My goal in writing this code was to minimize the changes
that need to be made to the kernel. Thus, most of the new code
is contained in the two new files softdep.h and ffs_softdep.c.
The rest of the kernel changes are simply inserting hooks to
call into these two new files. Although there has been some
structural reorganization of the filesystem code to accommodate
gathering the information required by the soft update code,
the actual ordering of filesystem operations when soft updates
are disabled is unchanged.
The kernel changes are packaged as a set of diffs. As I am
doing my development in BSD/OS, the diffs are relative to the
BSD/OS versions of the files. Because BSD/OS recently had
4.4BSD-Lite2 merged into it, the Lite2 files are a good starting
point for figuring out the changes. There are 40 files that
require change plus the two new files. Most of these files have
only a few lines of changes in them. However, four files have
fairly extensive changes: kern/vfs_subr.c, ufs/ufs/ufs_lookup.c,
ufs/ufs/ufs_vnops.c, and ufs/ffs/ffs_alloc.c. For these four
files, I have provided the original Lite2 version, the Lite2
version with the diffs merged in, and the diffs between the
BSD/OS and merged version. Even so, I expect that there will
be some difficulty in doing the merge; I am certainly willing
to assist in helping get the code merged into your system.
Packaging of Utility Changes
The utilities subdirectory contains the changes and additions
to the utilities. There are diffs to three utilities enclosed:
tunefs - add a flag to enable and disable soft updates
mount - print out whether soft updates are enabled and
also statistics on number of sync and async writes
fsck - tighter checks on acceptable errors and a slightly
different policy for what to put in lost+found on
filesystems using soft updates
In addition you should recompile vmstat so as to get reports
on the 13 new memory types used by the soft update code.
It is not necessary to use the new version of fsck, however it
would aid in my debugging if you do. Also, because of the time
lag between deleting a directory entry and the inode it
references, you will find a lot more files showing up in your
lost+found if you do not use the new version. Note that the
new version checks for the soft update flag in the superblock
and only uses the new algorithms if it is set. So, it will run
unchanged on the filesystems that are not using soft updates.
Operation
Once you have booted a kernel that incorporates the soft update
code and installed the updated utilities, do the following:
1) Comment out the update program in /etc/rc.
2) Run `tunefs -n enable' on one or more test filesystems.
3) Mount these filesystems and then type `mount' to ensure that
they have been enabled for soft updates.
4) Copy the test directory to a softdep filesystem, chdir into
it and run `./doit'. You may want to check out each of the
three subtests individually first: doit1 - andrew benchmarks,
doit2 - copy and removal of /etc, doit3 - find from /.
|