The multiqueue block layer

By Jonathan Corbet June 5, 2013

The kernel’s block layer is charged with managing I/O to the system’s block (“disk drive”) devices. It was designed in an era when a high-performance drive could handle hundreds of I/O operations per second (IOPs); the fact that it tends to fall down with modern devices, capable of handling possibly millions of IOPs, is thus not entirely surprising. It has been known for years that significant changes would need to be made to enable Linux to perform well on fast solid-state devices. The shape of those changes is becoming clearer as the multiqueue block layer patch set, primarily the work of Jens Axboe and Shaohua Li, gets closer to being ready for mainline merging.

커널의 블록 계층은 시스템의 블록 장치(디스크 드라이브)에 대한 I/O 관리를 담당합니다. 이것은 고성능 드라이브가 초당 수백 회의 I/O 작업을 처리 할 수 있었던 시대에 설계되었습니다. 따라서 최신의 장치들이 수백만 개의 IOPS를 처리 할 수 있고 이로 인해 블록 계층의 디자인이 추락하고 있다는 사실은 그리 놀랄만 한 일이 아닙니다. Linux가 고속 솔리드 스테이트 장치에서 잘 작동하려면 상당한 변화가 필요하다는 사실이 수년 전부터 지적되었습니다. 이러한 변화의 윤곽은 Jens Axboe와 Shaohua Li이 작업한 “multiqueue block layer set” 이 mainline 에 병합될 준비가 거의 끝남으로써 명확해지고 있습니다.

The basic structure of the block layer has not changed a whole lot since it was described for 2.6.10 in Linux Device Drivers. It offers two ways for a block driver to hook into the system, one of which is the “request” interface. When run in this mode, the block layer maintains a simple request queue; new I/O requests are submitted to the tail of the queue and the driver receives requests from the head. While requests sit in the queue, the block layer can operate on them in a number of ways: they can be reordered to minimize seek operations, adjacent requests can be coalesced into larger operations, and policies for fairness and bandwidth limits can be applied, for example.

블록 레이어의 기본 구조는 커널 2.6.10에 설명 된 이후로 많이 변경되지 않았습니다. 그것은 블록 드라이버가 시스템에 연결하는 두 가지 방법을 제공하며 그 중 하나는 “request” 방식입니다. 이 모드에서 실행하면 블록 계층은 간단한 요청 대기열을 유지합니다. 새로운 I/O 요청이 대기열의 꼬리에 제출되고 드라이버가 헤드에서 요청을 받습니다. 요청이 대기열에 있는 동안 블록 계층은 여러 가지 방법으로 작동 할 수 있습니다. 예를 들면, 검색 작업을 최소화하기 위해 재정렬 할 수 있고, 인접 요청을 더 큰 작업으로 통합 할 수 있으며, 공평성 및 대역폭 제한에 대한 정책을 적용 할 수 있습니다.

This request queue turns out to be one of the biggest bottlenecks in the entire system. It is protected by a single lock which, on a large system, will bounce frequently between the processors. It is a linked list, a notably cache-unfriendly data structure especially when modifications must be made – as they frequently are in the block layer. As a result, anybody who is trying to develop a driver for high-performance storage devices wants to do away with this request queue and replace it with something better.

이 요청 대기열은 전체 시스템에서 가장 큰 병목 지점 중 하나입니다. 이것은 대형 시스템에서 프로세서 간에 자주 바운스되는 단일 잠금 장치로 보호됩니다. 그것은 링크드 리스트이고 확실히 캐시와 친숙하지 않은데, 특히 수정이 꼭 필요한 상황에서 더욱 그렇습니다. 게다가 이런 상황은 block layer에서 자주 일어 납니다. 결과적으로 고성능 저장 장치 용 드라이버를 개발하려는 사람은 이 요청 대기열을 없애고 더 나은 것으로 대체하려고 합니다.

The second block driver mode – the “make request” interface – allows a driver to do exactly that. It hooks the driver into a much higher part of the stack, shorting out the request queue and handing I/O requests directly to the driver. This interface was not originally intended for high-performance drivers; instead, it is there for stacked drivers (the MD RAID implementation, for example) that need to process requests before passing them on to the real, underlying device. Using it in other situations incurs a substantial cost: all of the other queue processing done by the block layer is lost and must be reimplemented in the driver.

두 번째 블록 드라이버 모드는 – “make request” 방식 – 드라이버가 정확히 수행 할 수 있습니다. 드라이버를 스택의 훨씬 상위 부분에 연결하여 요청 큐를 단축시키고 I/O 요청을 드라이버에 직접 전달합니다. 이 인터페이스는 원래 고성능 드라이버를 위한 것이 아닙니다. 그 대신, 실제의 기본 장치에 요청을 전달하기 전에 요청을 처리해야 하는 스택 드라이버(예를 들면 MD RAID)를 위한 것입니다. 다른 상황에서 이를 사용하면 막대한 비용이 발생합니다. 블록 계층에 의해 수행되는 다른 큐 처리는 모두 손실되므로 드라이버에서 다시 구현해야 합니다.

The multiqueue block layer work tries to fix this problem by adding a third mode for drivers to use. In this mode, the request queue is split into a number of separate queues:

multiqueue 블록 계층 작업은 드라이버가 사용할 세 번째 모드를 추가하여 이 문제를 해결하려고 합니다. 이 모드에서 요청 대기열은 여러 개의 개별 대기열로 분할됩니다.

Submission queues are set up on a per-CPU or per-node basis. Each CPU submits I/O operations into its own queue, with no interaction with the other CPUs. Contention for the submission queue lock is thus eliminated (when per-CPU queues are used) or greatly reduced (for per-node queues). One or more hardware dispatch queues simply buffer I/O requests for the driver.

전송 큐는 CPU 단위 또는 노드 단위로 설정됩니다. 각 CPU는 I/O 작업을 다른 CPU와 상호 작용하지 않고 자체 큐에 제출합니다. 따라서 제출 대기열 잠금에 대한 경합은 제거되거나(CPU 당 대기열 사용시) 또는 크게 감소됩니다 (노드 당 대기열 사용시). 하나 이상의 하드웨어 디스패치 대기열은 드라이버에 대한 I/O 요청을 단순히 버퍼링합니다.

While requests are in the submission queue, they can be operated on by the block layer in the usual manner. Reordering of requests for locality offers little or no benefit on solid-state devices; indeed, spreading requests out across the device might help with the parallel processing of requests. So reordering will not be done, but coalescing requests will reduce the total number of I/O operations, improving performance somewhat. Since the submission queues are per-CPU, there is no way to coalesce requests submitted to different queues. With no empirical evidence whatsoever, your editor would guess that adjacent requests are most likely to come from the same process and, thus, will automatically find their way into the same submission queue, so the lack of cross-CPU coalescing is probably not a big problem.

요청은 제출 대기열에 있지만 일반적인 방법으로 블록 계층에서 조작 할 수 있습니다. 지역성을 위해 요청을 재정렬하는 것은 SSD 장치에는 거의 또는 전혀 도움이 되지 않습니다. 장치에 요청을 나누면 요청을 병렬 처리하는 데 도움이 될 수 있습니다. 따라서 재정렬은 수행되지 않지만 요청을 통합하는 것은 I/O 작업의 총 수가 줄어들어 성능이 어느 정도 향상됩니다. 제출 큐는 CPU 단위이므로 다른 큐에 제출 된 요청을 통합하는 방법은 없습니다. 경험적 증거가 전혀 없으면 편집자는 인접한 요청이 동일한 프로세스에서 발생할 가능성이 가장 높으며 따라서 동일한 제출 대기열로 자동으로 이동하여 교차 CPU 병합의 부족이 큰 것은 아닐 것이라고 추측할 것입니다.

The block layer will move requests from the submission queues into the hardware queues up to the maximum number specified by the driver. Most current devices will have a single hardware queue, but high-end devices already support multiple queues to increase parallelism. On such a device, the entire submission and completion path should be able to run on the same CPU as the process generating the I/O, maximizing cache locality (and, thus, performance). If desired, fairness or bandwidth-cap policies can be applied as requests move to the hardware queues, but there will be an associated performance cost. Given the speed of high-end devices, it may not be worthwhile to try to ensure fairness between users; everybody should be able to get all the I/O bandwidth they can use.

블록 계층은 제출 큐에서 드라이버가 지정한 최대 수까지 하드웨어 대기열로 요청을 이동합니다. 현재의 대부분의 장치는 단일 하드웨어 대기열을 갖지만 하이 엔드 장치는 이미 병렬 처리를 높이기 위해 여러 대기열을 지원합니다. 이러한 장치에서 전체 제출 및 완료 경로는 I/O를 생성하는 프로세스와 동일한 CPU에서 실행될 수 있어야하므로 캐시 지역성 (및 성능)이 극대화됩니다. 원하는 경우, 요청이 하드웨어 대기열로 이동함에 따라 공정성 또는 대역폭 캡 정책을 적용 할 수 있지만 관련 성능 비용이 발생합니다. 하이 엔드 기기의 속도를 감안할 때 사용자 간 공정성을 보장하는 것은 가치가 없을 수 있습니다. 모두가 사용할 수 있는 모든 I/O 대역폭을 확보 할 수 있어야 합니다.

This structure makes the writing of a high-performance block driver (relatively) simple. The driver provides a queue_rq() function to handle incoming requests and calls back to the block layer when requests complete. Those wanting to look at an example of how such a driver would work can see null_blk.c in the new-queue branch of Jens’s block repository:

이 구조는 (상대적으로) 고성능 블록 드라이버의 작성을 단순하게 만듭니다. 드라이버는 들어오는 요청을 처리하는 queue_rq() 함수를 제공하고 요청이 완료되면 블록 계층으로 다시 호출합니다. 그러한 드라이버가 어떻게 작동하는지 예제를보고 싶다면 Jens의 블록 저장소의 new-queue 브랜치에있는 null_blk.c를 볼 수 있습니다.

git://git.kernel.dk/linux-block.git

In the current patch set, the multiqueue mode is offered in addition to the existing two modes, so current drivers will continue to work without change. According to this paper on the multiqueue block layer design [PDF], the hope is that drivers will migrate over to the multiqueue API, allowing the eventual removal of the request-based mode.

현재 패치 세트에서는 기존의 두 가지 모드 외에 멀티큐 모드가 제공되므로 현재 드라이버는 변경없이 계속 작동합니다. 다중 계층 블록 계층 설계 [PDF]에 있는 이 문서에 따르면, 드라이버는 다중 큐 API로 마이그레이션하여 결국 요청 기반 모드를 제거 할 수 있다고 합니다.

This patch set has been significantly reworked in the last month or so; it has gone from a relatively messy series into something rather cleaner. Merging into the mainline would thus appear to be on the agenda for the near future. Since use of this API is optional, existing drivers should continue to work and this merge could conceivably happen as early as 3.11. But, given that the patch set has not yet been publicly posted to any mailing list and does not appear in linux-next, 3.12 seems like a more likely target. Either way, Linux seems likely to have a much better block layer by the end of the year or so.

이 패치 세트는 지난 한 달 정도 크게 수정 되었습니다. 비교적 지저분한 시리즈에서 다소 깨끗한 것으로 변했습니다. 따라서 본선에 합병하면 가까운 장래에 의제로 등장 할 것입니다. 이 API의 사용은 선택 사항이므로 기존 드라이버는 계속 작동해야하며 이 병합은 3.11 버전에서 발생할 수 있습니다. 그러나 패치 세트가 공개적으로 메일 링리스트에 게시되지 않았고 linux-next에 나타나지 않는다면 3.12가 더 가능성있는 타겟으로 보입니다. 어느 쪽이든, 리눅스는 연말까지 훨씬 더 나은 블록 레이어를 가질 것으로 보입니다.

Original text https://lwn.net/Articles/603252/
Translation Date 2018/05/21
Google Translator helped me.