diff options
Diffstat (limited to 'share/man/man4/geom.4')
-rw-r--r-- | share/man/man4/geom.4 | 467 |
1 files changed, 467 insertions, 0 deletions
diff --git a/share/man/man4/geom.4 b/share/man/man4/geom.4 new file mode 100644 index 000000000000..38573893357f --- /dev/null +++ b/share/man/man4/geom.4 @@ -0,0 +1,467 @@ +.\" +.\" Copyright (c) 2002 Poul-Henning Kamp +.\" Copyright (c) 2002 Networks Associates Technology, Inc. +.\" All rights reserved. +.\" +.\" This software was developed for the FreeBSD Project by Poul-Henning Kamp +.\" and NAI Labs, the Security Research Division of Network Associates, Inc. +.\" under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the +.\" DARPA CHATS research program. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. The names of the authors may not be used to endorse or promote +.\" products derived from this software without specific prior written +.\" permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd May 25, 2006 +.Os +.Dt GEOM 4 +.Sh NAME +.Nm GEOM +.Nd "modular disk I/O request transformation framework" +.Sh DESCRIPTION +The +.Nm +framework provides an infrastructure in which +.Dq classes +can perform transformations on disk I/O requests on their path from +the upper kernel to the device drivers and back. +.Pp +Transformations in a +.Nm +context range from the simple geometric +displacement performed in typical disk partitioning modules over RAID +algorithms and device multipath resolution to full blown cryptographic +protection of the stored data. +.Pp +Compared to traditional +.Dq "volume management" , +.Nm +differs from most +and in some cases all previous implementations in the following ways: +.Bl -bullet +.It +.Nm +is extensible. +It is trivially simple to write a new class +of transformation and it will not be given stepchild treatment. +If +someone for some reason wanted to mount IBM MVS diskpacks, a class +recognizing and configuring their VTOC information would be a trivial +matter. +.It +.Nm +is topologically agnostic. +Most volume management implementations +have very strict notions of how classes can fit together, very often +one fixed hierarchy is provided, for instance, subdisk - plex - +volume. +.El +.Pp +Being extensible means that new transformations are treated no differently +than existing transformations. +.Pp +Fixed hierarchies are bad because they make it impossible to express +the intent efficiently. +In the fixed hierarchy above, it is not possible to mirror two +physical disks and then partition the mirror into subdisks, instead +one is forced to make subdisks on the physical volumes and to mirror +these two and two, resulting in a much more complex configuration. +.Nm +on the other hand does not care in which order things are done, +the only restriction is that cycles in the graph will not be allowed. +.Sh "TERMINOLOGY AND TOPOLOGY" +.Nm +is quite object oriented and consequently the terminology +borrows a lot of context and semantics from the OO vocabulary: +.Pp +A +.Dq class , +represented by the data structure +.Vt g_class +implements one +particular kind of transformation. +Typical examples are MBR disk +partition, BSD disklabel, and RAID5 classes. +.Pp +An instance of a class is called a +.Dq geom +and represented by the data structure +.Vt g_geom . +In a typical i386 +.Fx +system, there +will be one geom of class MBR for each disk. +.Pp +A +.Dq provider , +represented by the data structure +.Vt g_provider , +is the front gate at which a geom offers service. +A provider is +.Do +a disk-like thing which appears in +.Pa /dev +.Dc - a logical +disk in other words. +All providers have three main properties: +.Dq name , +.Dq sectorsize +and +.Dq size . +.Pp +A +.Dq consumer +is the backdoor through which a geom connects to another +geom provider and through which I/O requests are sent. +.Pp +The topological relationship between these entities are as follows: +.Bl -bullet +.It +A class has zero or more geom instances. +.It +A geom has exactly one class it is derived from. +.It +A geom has zero or more consumers. +.It +A geom has zero or more providers. +.It +A consumer can be attached to zero or one providers. +.It +A provider can have zero or more consumers attached. +.El +.Pp +All geoms have a rank-number assigned, which is used to detect and +prevent loops in the acyclic directed graph. +This rank number is +assigned as follows: +.Bl -enum +.It +A geom with no attached consumers has rank=1. +.It +A geom with attached consumers has a rank one higher than the +highest rank of the geoms of the providers its consumers are +attached to. +.El +.Sh "SPECIAL TOPOLOGICAL MANEUVERS" +In addition to the straightforward attach, which attaches a consumer +to a provider, and detach, which breaks the bond, a number of special +topological maneuvers exists to facilitate configuration and to +improve the overall flexibility. +.Bl -inset +.It Em TASTING +is a process that happens whenever a new class or new provider +is created, and it provides the class a chance to automatically configure an +instance on providers which it recognizes as its own. +A typical example is the MBR disk-partition class which will look for +the MBR table in the first sector and, if found and validated, will +instantiate a geom to multiplex according to the contents of the MBR. +.Pp +A new class will be offered to all existing providers in turn and a new +provider will be offered to all classes in turn. +.Pp +Exactly what a class does to recognize if it should accept the offered +provider is not defined by +.Nm , +but the sensible set of options are: +.Bl -bullet +.It +Examine specific data structures on the disk. +.It +Examine properties like +.Dq sectorsize +or +.Dq mediasize +for the provider. +.It +Examine the rank number of the provider's geom. +.It +Examine the method name of the provider's geom. +.El +.It Em ORPHANIZATION +is the process by which a provider is removed while +it potentially is still being used. +.Pp +When a geom orphans a provider, all future I/O requests will +.Dq bounce +on the provider with an error code set by the geom. +Any +consumers attached to the provider will receive notification about +the orphanization when the event loop gets around to it, and they +can take appropriate action at that time. +.Pp +A geom which came into being as a result of a normal taste operation +should self-destruct unless it has a way to keep functioning whilst +lacking the orphaned provider. +Geoms like disk slicers should therefore self-destruct whereas +RAID5 or mirror geoms will be able to continue as long as they do +not lose quorum. +.Pp +When a provider is orphaned, this does not necessarily result in any +immediate change in the topology: any attached consumers are still +attached, any opened paths are still open, any outstanding I/O +requests are still outstanding. +.Pp +The typical scenario is: +.Pp +.Bl -bullet -offset indent -compact +.It +A device driver detects a disk has departed and orphans the provider for it. +.It +The geoms on top of the disk receive the orphanization event and +orphan all their providers in turn. +Providers which are not attached to will typically self-destruct +right away. +This process continues in a quasi-recursive fashion until all +relevant pieces of the tree have heard the bad news. +.It +Eventually the buck stops when it reaches geom_dev at the top +of the stack. +.It +Geom_dev will call +.Xr destroy_dev 9 +to stop any more requests from +coming in. +It will sleep until any and all outstanding I/O requests have +been returned. +It will explicitly close (i.e.: zero the access counts), a change +which will propagate all the way down through the mesh. +It will then detach and destroy its geom. +.It +The geom whose provider is now detached will destroy the provider, +detach and destroy its consumer and destroy its geom. +.It +This process percolates all the way down through the mesh, until +the cleanup is complete. +.El +.Pp +While this approach seems byzantine, it does provide the maximum +flexibility and robustness in handling disappearing devices. +.Pp +The one absolutely crucial detail to be aware of is that if the +device driver does not return all I/O requests, the tree will +not unravel. +.It Em SPOILING +is a special case of orphanization used to protect +against stale metadata. +It is probably easiest to understand spoiling by going through +an example. +.Pp +Imagine a disk, +.Pa da0 , +on top of which an MBR geom provides +.Pa da0s1 +and +.Pa da0s2 , +and on top of +.Pa da0s1 +a BSD geom provides +.Pa da0s1a +through +.Pa da0s1e , +and that both the MBR and BSD geoms have +autoconfigured based on data structures on the disk media. +Now imagine the case where +.Pa da0 +is opened for writing and those +data structures are modified or overwritten: now the geoms would +be operating on stale metadata unless some notification system +can inform them otherwise. +.Pp +To avoid this situation, when the open of +.Pa da0 +for write happens, +all attached consumers are told about this and geoms like +MBR and BSD will self-destruct as a result. +When +.Pa da0 +is closed, it will be offered for tasting again +and, if the data structures for MBR and BSD are still there, new +geoms will instantiate themselves anew. +.Pp +Now for the fine print: +.Pp +If any of the paths through the MBR or BSD module were open, they +would have opened downwards with an exclusive bit thus rendering it +impossible to open +.Pa da0 +for writing in that case. +Conversely, +the requested exclusive bit would render it impossible to open a +path through the MBR geom while +.Pa da0 +is open for writing. +.Pp +From this it also follows that changing the size of open geoms can +only be done with their cooperation. +.Pp +Finally: the spoiling only happens when the write count goes from +zero to non-zero and the retasting happens only when the write count goes +from non-zero to zero. +.It Em INSERT/DELETE +are very special operations which allow a new geom +to be instantiated between a consumer and a provider attached to +each other and to remove it again. +.Pp +To understand the utility of this, imagine a provider +being mounted as a file system. +Between the DEVFS geom's consumer and its provider we insert +a mirror module which configures itself with one mirror +copy and consequently is transparent to the I/O requests +on the path. +We can now configure yet a mirror copy on the mirror geom, +request a synchronization, and finally drop the first mirror +copy. +We have now, in essence, moved a mounted file system from one +disk to another while it was being used. +At this point the mirror geom can be deleted from the path +again; it has served its purpose. +.It Em CONFIGURE +is the process where the administrator issues instructions +for a particular class to instantiate itself. +There are multiple +ways to express intent in this case - a particular provider may be +specified with a level of override forcing, for instance, a BSD +disklabel module to attach to a provider which was not found palatable +during the TASTE operation. +.Pp +Finally, I/O is the reason we even do this: it concerns itself with +sending I/O requests through the graph. +.It Em "I/O REQUESTS" , +represented by +.Vt "struct bio" , +originate at a consumer, +are scheduled on its attached provider and, when processed, are returned +to the consumer. +It is important to realize that the +.Vt "struct bio" +which enters through the provider of a particular geom does not +.Do +come out on the other side +.Dc . +Even simple transformations like MBR and BSD will clone the +.Vt "struct bio" , +modify the clone, and schedule the clone on their +own consumer. +Note that cloning the +.Vt "struct bio" +does not involve cloning the +actual data area specified in the I/O request. +.Pp +In total, four different I/O requests exist in +.Nm : +read, write, delete, and +.Dq "get attribute". +.Pp +Read and write are self explanatory. +.Pp +Delete indicates that a certain range of data is no longer used +and that it can be erased or freed as the underlying technology +supports. +Technologies like flash adaptation layers can arrange to erase +the relevant blocks before they will become reassigned and +cryptographic devices may want to fill random bits into the +range to reduce the amount of data available for attack. +.Pp +It is important to recognize that a delete indication is not a +request and consequently there is no guarantee that the data actually +will be erased or made unavailable unless guaranteed by specific +geoms in the graph. +If +.Dq "secure delete" +semantics are required, a +geom should be pushed which converts delete indications into (a +sequence of) write requests. +.Pp +.Dq "Get attribute" +supports inspection and manipulation +of out-of-band attributes on a particular provider or path. +Attributes are named by +.Tn ASCII +strings and they will be discussed in +a separate section below. +.El +.Pp +(Stay tuned while the author rests his brain and fingers: more to come.) +.Sh DIAGNOSTICS +Several flags are provided for tracing +.Nm +operations and unlocking +protection mechanisms via the +.Va kern.geom.debugflags +sysctl. +All of these flags are off by default, and great care should be taken in +turning them on. +.Bl -tag -width indent +.It 0x01 Pq Dv G_T_TOPOLOGY +Provide tracing of topology change events. +.It 0x02 Pq Dv G_T_BIO +Provide tracing of buffer I/O requests. +.It 0x04 Pq Dv G_T_ACCESS +Provide tracing of access check controls. +.It 0x08 (unused) +.It 0x10 (allow foot shooting) +Allow writing to Rank 1 providers. +This would, for example, allow the super-user to overwrite the MBR on the root +disk or write random sectors elsewhere to a mounted disk. +The implications are obvious. +.It 0x40 Pq Dv G_F_DISKIOCTL +This is unused at this time. +.It 0x80 Pq Dv G_F_CTLDUMP +Dump contents of gctl requests. +.El +.Sh SEE ALSO +.Xr libgeom 3 , +.Xr disk 9 , +.Xr DECLARE_GEOM_CLASS 9 , +.Xr g_access 9 , +.Xr g_attach 9 , +.Xr g_bio 9 , +.Xr g_consumer 9 , +.Xr g_data 9 , +.Xr g_event 9 , +.Xr g_geom 9 , +.Xr g_provider 9 , +.Xr g_provider_by_name 9 +.Sh HISTORY +This software was developed for the +.Fx +Project by +.An Poul-Henning Kamp +and NAI Labs, the Security Research Division of Network Associates, Inc.\& +under DARPA/SPAWAR contract N66001-01-C-8035 +.Pq Dq CBOSS , +as part of the +DARPA CHATS research program. +.Pp +The first precursor for +.Nm +was a gruesome hack to Minix 1.2 and was +never distributed. +An earlier attempt to implement a less general scheme +in +.Fx +never succeeded. +.Sh AUTHORS +.An "Poul-Henning Kamp" Aq phk@FreeBSD.org |