Block devices offer storage for chunks of data. The device is divided into several sectors. This is the basic unit of transfer. File systems handle data in chunks called blocks. A block can comprise of one or more sectors on the device.
Device drivers are needed to perform the actual I/O on the device. The kernel provides various hooks for these device drivers so that it will handover the control to the driver to perform the I/O when needed.
There can be several types of block devices each with their own way of performing the I/O. The kernel hides all these details and provides a common interface to access the block devices. The callers can just ask the kernel to fetch the 'n' th block on the device (there will be a mechanism to identify devices).
Since block I/O is a costly operation, the kernel has an additional responsibility of optimizing the I/O operations. It caches the blocks in memory so that subsequent accesses to the blocks need not go to the disk. Write operations on the block will just update the block in memory. But to protect data integrity, the kernel also needs to sync them to the disk.
Each block device will have a device driver which will provide the functionality to perform the actual I/O. They have to perform certain steps to register with the kernel so that the kernel will delegate various tasks to the drivers.
Kernel identifies block devices with their major and minor numbers. The major number is used to identify the device driver and the minor number is used to identify the partition within the device. For example, the device driver for the hard disk will manage all the partitions on the disk. It can be seen that the device files for different partitions will have the same major number but will have different minor numbers.
These devices are accessed as special type of files with the file type identifying them as block devices. The file has a major number and minor number. Since the kernel needs these two fields to identify the device, the file can be located any where (i.e., the path is irrelevant) and there can be any number of files with the same identifiers. All of them refer to the same device (and partition). For example, the partitions of the hard disk are normally named as hda1, hda2 etc under /dev folder. This is just a convention.
The major number of the device identifies the device driver.
For
this, the device drivers will have to register the major number with
the kernel. register_blkdev is used to
register a block
device major number.
If the drivers are not particular about the major number and can live with any major number, they can pass zero so that the kernel will assign a major number and return it.
The kernel defines structures called requests which describe
an
I/O request. It maintains queues of such requests. The drivers need
to create a request queue for the kernel to add requests.
blk_init_queue is used to create a request
queue.
Drivers also need to register structures called gendisk.
The allocated queue has to be assigned to the disk. gendisk
structures can be allocated using alloc_disk
and
registered using register_disk.
So these are the broad steps the driver needs to perform:
Register the major number
Create a request queue by passing the request handler
Create a gendisk structure and fill the details like the
major and minor numbers. Set the block_device_operations
field to the table with the driver handlers. Assign the request queue.
Register the disk.
The disk size can be set using set_capacity
at the
time of device opening.
After this, the device driver is ready to perform I/O on the device. Operations performed on the disk like open, release, ioctl etc are handed over to those registered by the device driver. For the I/O, the request handler which was registered, while allocating the queue, will be called.
In the kernel, data from block devices are accessed as blocks. A block is described by the structure buffer_head. For example, when a page has to be read from the disk, the number of blocks required to load the page are calculated. An I/O request is submitted to the block layer to load the blocks.
Block I/O is a costly operation and hence the kernel tries to retain the data in memory. Reads and writes to the blocks are not synchronized immediately to the device. The blocks are cached in memory and writes to them are performed in memory and are marked as 'dirty'. The kernel flushes them to the device periodically.
The result of this is that when multiple read operations are performed for a particular block, the actual I/O from the device happens only once. Since the buffer is cached, the subsequent reads will find it in the cache. Similarly, writes to the buffer will just update the buffer and mark them as dirty. The kernel will take care of writing it back to the disk.
The I/O can be performed on the device in terms of pages or blocks. For example, file systems access metadata in terms of blocks and a file can be mapped into memory and any read from the mapped address will load the data from the file stored on the disk. The kernel ensures that in both the approaches, the blocks are available in the cache and there is no duplication of I/O.
A block, which is described by the buffer_head
structure, is submitted for I/O using submit_bh
function. There is a new structure added in 2.6 kernel to describe
the information needed to perform the I/O operation. This is the bio
structure. In 2.4 kernel, the buffer_head structure was overloaded to
hold both the information about the block as well as the I/O
information. This resulted in splitting and joining of requests for
large contiguous I/Os. The bio structure provides many benefits.
The submit_bh function creates a bio structure and fills the information like the sector, size, the location in memory where the data needs to be read into/from. It then calls submit_bio which checks some parameters and then calls generic_make_request to make the request. This function then gets the request queue and then calls the make_request_fn handler to add the request to the request queue.
Most of the drivers allocate the request queue using blk_init_queue. This sets the make_request_fn as __make_request. If the device drivers want to manage the request queue, they need to allocate the queue and set the fields with their own handlers. The make_request_fn can be set using blk_queue_make_request function. For example, ramdisk code uses this. This can be seen in the following code snippet which is the module initialization code of ramdisk. Apart from the queue allocation, other registration steps needed by device drivers can also be seen in this. Drivers like the floppy driver use the default queue allocation.
The I/O performance for most of the block devices depends on the sequence of the block locations. It works the best when the requests are sorted in a particular order and suffers heavily if they are scattered randomly. For example, for hard disks the best case is when the block requests (their locations) are in the same direction so that the seek time for the head is minimized. This is similar to the working of an elevator. To exploit such behaviors, the kernel maintains different elevator algorithms. The request queue is assigned an elevator algorithm which will organize/sort the requests in the queue.
The __make_function performs the task of submitting the bio structure to the request queue either by merging it into an existing request or by creating a new request. It will use the request queue's elevator algorithm to insert the request (or merge).
This function calls elv_merge to find out if the bio structure can be merged with any existing request in the request queue. The elevator will suggest a request structure and will suggest if the bio structure can be merged at the front or back of the request.
If it can be merged, it calls the corresponding merge function of the request queue to check if it can be merged. The request queue handler can check if there are any constraints in merging the requests and also update the request structure. If this succeeds, the bio structure is merged into the request.
If there is any problem in merging or if the bio cannot be merged, a new request structure has to be created. It calls get_request_wait to get a request structure (it may have to wait if there are no free structures). It fills the request structure and calls add_request to add the request to the queue.
Before adding the request to the queue, it checks if the queue is empty. If the queue is empty, then the device will be 'plugged' which will start a timer which will be triggered after the 'unplug delay' specified in the request queue. After the timeout, the unplug_fn handler of the request will be called which will 'unplug' the device. The requests will be handled when it is unplugged.
The default handler for unplugging is generic_unplug_device. This will call the request_fn handler of the request queue which was registered by the device driver. The device driver will have to process the requests in the queue.
I/O requests are submitted by filling a bio structure and calling submit_bio. Data is accessed in the kernel as blocks or pages. The page may need multiple blocks depending on the block size. Wrapper functions are provided by the kernel which make use of this function to submit the request. These functions first check if the blocks are available in the cache and submit the request if they are not available.
Data is read from files in terms of pages. Data from the device files can be read as any other files. These pages are cached in memory. The file is treated as a sequence of pages and individual pages are loaded as needed. Such pages are added to the page cache. So, before loading the page from the disk, the page cache is search to see if the page is already present.
Since a file can be read using the read system call or by mapping them in memory (for example, application binaries are mapped into memory), the cached pages are stored with the address space object. The kernel makes use of this for serving the read requests coming from the read system call also.
Each opened inode has an associated address space object which stores the mapping information like the loaded pages etc. It has an associated address space operation table with handlers to perform operations on the address space like readpage, writepage etc.
When the file is mapped into memory the internal data structures (vm_area_struct) are updated to specify that the mapped memory area is a valid one. With demand paging, the read is triggered only when there is a page fault (also known as demand paging). The write handlers on the address space will trigger a write operation to the device.
Read and write system calls delegate the task to the handlers in the file operation table. File systems can register their own handlers but they normally use generic_file_read for read and generic_file_write for write handlers.
The function generic_file_read looks up for the pages in the page cache and if it is not present, calls the readpage handler of the address space object.
The function generic_file_write uses the prepare_write and commit_write handler of the address space operations table to write the data to the buffer. File systems normally use generic_prepare_write and generic_commit_write as the handlers. The commit write handler marks the buffers as dirty as well as adding the inode to the dirty inodes of the superblock (so that the files can be flushed before unmounting).
__bread is used to read a block from a device. This first checks if the buffer is available in the cache. Instead of maintaining a separate buffer cache, the kernel makes use of the page cache. The pages of data loaded from a file (inode) are cached and are accessed using its address space object. The offset of the page is used to locate the page.
When blocks are loaded from the device, the corresponding pages are added to the page cache. So, while checking to see if the block is present in the cache, the kernel computes the page offset from the block number. It searches if that page is available in the address space object of the device inode. If it finds the page, it looks at the associated buffers to see if there is a buffer_head structure for the required buffer. If there is a buffer, it will be returned.
If there is no page, a page will be allocated and buffers created and linked to the page. If there is a page and no corresponding buffers, buffers are allocated and linked to it. So, what this achieves finally is to get a buffer_head structure. If this is not uptodate, it will be submitted for I/O and waits for the completion of the operation.