| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty
namespace. In some cases like hot-plug it takes longer, triggering
timeout and controller resets after just 30 seconds. Linux for many
years has separate 60 seconds timeout for admin queue. This patch
does the same. And it is good to be consistent.
Sponsored by: iXsystems, Inc.
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42454
(cherry picked from commit 8d6c0743e36e3cff9279c40468711a82db98df23)
|
|
|
|
|
|
|
| |
Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
Similar commit in current:
(cherry picked from commit 685dc743dc3b)
|
|
|
|
|
|
|
|
|
|
|
| |
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
(cherry picked from commit 4d846d260e2b9a3d4d0a701462568268cbfe7a5b)
|
|
|
|
|
|
| |
MFC after: 1 week
(cherry picked from commit 49ebbdb264fe185a685dce846985b95f28320e3f)
|
|
|
|
|
|
|
|
| |
It may help with some issues debugging.
MFC after: 1 week
(cherry picked from commit a69c0964625f4e20a7d5f22d51e036e13eedbeeb)
|
|
|
|
|
|
| |
- s/is is/is/
(cherry picked from commit dfa01f4f98a1343bf375c54a5cd44718c4211bec)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Admin queues almost always have several ASYNC_EVENT_REQUEST outstanding.
They have no timeouts, but their presence in qpair->outstanding_tr caused
useless timeout callout rearming twice a second.
While there, relax timeout callout period from 0.5s to 0.5-1s to improve
aggregation. Command timeouts are measured in seconds, so we don't need
to be precise here.
Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D33781
(cherry picked from commit b3c9b6060f9a3525196867d8e812b24fc0bc61e1)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reduce traffic to doorbell register when processing multiple completion
events at once. Only write it at the end of the loop after we've
processed everything (assuming we found at least one completion,
even if that completion wasn't valid).
Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32470
(cherry picked from commit 2ec165e3f065217ae8d54a2a8235fe1f219805ea)
|
|
|
|
|
|
|
|
|
| |
Restore hotplug warning in recovery state machine. No functional change
other than what message gets printed.
Sponsored by: Netflix
(cherry picked from commit 18dc12bfd2e23ad2ea97db54cb8ee499f6f014da)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make sure the completion ID is in the range of [0..num_trackers) since
the values past the end of the act_tr array are never going to be valid
trackers and will lead to pain and suffering if we try to dereference
them to get the tracker or to set the tracker back to NULL as we
complete the I/O.
Sponsored by: Netflix
Reviewed by: mav, chs, chuck
Differential Revision: https://reviews.freebsd.org/D32088
(cherry picked from commit 36a87d0c6fe9d65de23f177ef84000b205f87e39)
|
|
|
|
|
|
|
|
|
|
|
| |
Count the number of times we're asked to process completions, but that
we ignore because the state of the qpair isn't in RECOVERY_NONE.
Sponsored by: Netflix
Reviewed by: mav, chuck
Differential Revision: https://reviews.freebsd.org/D32212
(cherry picked from commit 587aa25525e54ea775298c402acd7a647f9838fb)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The proper phase for the qpiar right after reset in the first interrupt
is 1. For it, make sure that we're not still in phase 0. This is an
illegal state to be processing interrupts and indicates that we've
failed to properly protect against a race between initializing our state
and processing interrupts. Modify stat resetting code so it resets the
number of interrpts to 1 instead of 0 so we don't trigger a false
positive panic.
Sponsored by: Netflix
Reviewed by: cperciva, mav (prior version)
Differential Revision: https://reviews.freebsd.org/D32211
(cherry picked from commit 7d5eebe0f4a0f2aa5c8c7dfdd1a9ce1513849da8)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An interrupt happens on the admin queue right away after the reset, so
as soon as we enable interrupts, we'll get a call to our interrupt
handler. It is safe to ignore this interrupt if we're not yet
initialized, or to process it if we are. If we are initialized, we'll
see there's no completion records and return. If we're not, we'll
process no completion records and return. Either way, nothing is
processed and nothing is lost.
Until we've completely setup the qpair, we need to avoid processing
completion records. Start the qpair in the waiting recovery state so we
return immediately when we try to process completions. The code already
sets it to 'NONE' when we're initialization is complete. It's safe to
defer completion processing here because we don't send any commands
before the initialization of the software state of the qpair is
complete. And even if we were to somehow send a command prior to that
completing, the completion record for that command would be processed
when we send commands to the admin qpair after we've setup the software
state. There's no good central point to add an assert for this last
condition.
This fixes an KASSERT "received completion for unknown cmd" panic on
boot.
Fixes: 502dc84a8b6703e7c0626739179a3cdffdd22d81
Sponsored by: Netflix
Reviewed by: mav, cperciva, gallatin
Differential Revision: https://reviews.freebsd.org/D32210
(cherry picked from commit fa81f3731d1a2984a28ae44e60d12a0659b8fd2f)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28583
(cherry picked from commit 502dc84a8b6703e7c0626739179a3cdffdd22d81)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Once the controller has failed, fail all I/O w/o sending it to the
device. The reset of the nvme driver won't schedule any I/O to the
failed device, and the controller is in an indeterminate state and can't
accept I/O. Fail both at the top end of the sim and the bottom
end. Don't bother queueing up the I/O for failure in a different task.
Reviewed by: chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31341
(cherry picked from commit 4b977e6dda92fe093ea300f1a91dbcf877b64fa0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared. If still no luck, fall back to shared INTx.
This provides maximal flexibility in some limited scenarios. For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.
MFC after: 1 week
(cherry picked from commit e3bdf3da769a55f0944d9c337bb4d91b6435f02c)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To guard against the ill effects of a spurious interrupt during
construction (or one that was bogusly pending), enable interrupts after
the qpair is completely constructed. Otherwise, we can die with null
pointer dereferences in nvme_qpair_process_completions. This has been
observed in at least one pre-release NVMe drive where the MSIX interrupt
fired while the queue was being created, before we'd started the NVMe
controller card.
The alternative of only turning on the interrupts after the rest was
tried, but was insufficient to work around this bug and made the code
more complicated w/o benefit.
Reviewed by: mav, chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31182
(cherry picked from commit fc9a0840231770bc7e7dcfe4616babdc6d4389a6)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Coherently read the phase bit of the status completion record. We loop
over the completion record array, looking for all the transactions in
the same phase that have been completed. In doing that, we have to be
careful to read the status field first, and if it indicates a complete
record, we need to read and process that record. Otherwise, the host
might be overtaken by device when reading this completion record,
leading to a mistaken belief that the record is in phase. This leads to
the code using old values and looking at an already completed entry, which
has no current tracker.
To work around this problem, we read the status and make sure it is in
phase, we then re-read the entire completion record guaranteeing it's
complete, valid, and consistent . In addition we resync the dmatag to
reflect changes since the prior loop for the bouncing dma case.
Reviewed by: jrtc27@, chuck@
Found by: jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31002
(cherry picked from commit aa0ab681ae755e01cd69435fab50f6852f248c42)
|
|
|
|
|
|
|
| |
Make it clearer that the value 0xfffffff is being used to detect the device is
gone. We use it other places in the driver for other meanings.
(cherry picked from commit 9600aa31aa633bbb9e8a56d91a781d5a7ce2bff6)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can
be DMA'd immediately after it returns. The details differ, but this
mirrors atomic thread release semantics, at least for the buffers
synced.
For non-x86 platforms, bus_dmamap_sync() has the right syncing and
fences. So in the past, wmb() had been omitted for them.
For x86 platforms, the memory ordering is already strong enough to
ensure DMA to the device sees the current contents. As such, we don't
need the wmb() here. It translates to an sfence which is only needed
for writes to regions that have the write combining attribute set or
when some exotic opcodes are used. The nvme driver does neither of
these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we
can be assured any optimizer won't reorder the bus_dmamap_sync and the
bus_space_write operations. The wmb() was a vestiage of the pre-busdma
version initially committed to the tree.
Reviewed by: kib@, gallatin@, chuck@, mav@
Differential Revision: https://reviews.freebsd.org/D27448
Notes:
svn path=/head/; revision=368352
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- in nvme_qpair_process_completions() do dma sync before completion buffer
is used.
- in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm
and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure
that all CPU stores are visible to external (including DMA) observers.
- Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems,
buffers continuously owned (and accessed) by DMA must be allocated with this
flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems
(or coherent buses in mixed systems).
MFC after: 4 weeks
Reviewed by: mav, imp
Differential Revision: https://reviews.freebsd.org/D27446
Notes:
svn path=/head/; revision=368279
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change occurrences of "selt test" to "self tests in the NVMe header
file.
Reviewed by: imp, mav
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27439
Notes:
svn path=/head/; revision=368275
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.
On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs. The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.
Reviewed by: imp
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Notes:
svn path=/head/; revision=368132
|
|
|
|
| |
Notes:
svn path=/head/; revision=365189
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches. They no longer serve any
purpose since UMA now provides their functionality by default. Remove
them to simplyify the kernel memory allocator interfaces a bit.
Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25937
Notes:
svn path=/head/; revision=363834
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests. If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.
In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Notes:
svn path=/head/; revision=362337
|
|
|
|
|
|
|
|
| |
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Notes:
svn path=/head/; revision=362282
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes several Coverity-detected errors in the nvme driver.
CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975,
1403980
Reviewed by: imp@, vangyzen@
MFC after: 5 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D24532
Notes:
svn path=/head/; revision=360568
|
|
|
|
| |
Notes:
svn path=/head/; revision=359441
|
|
|
|
|
|
|
|
|
|
|
| |
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.
MFC after: 3 weeks
Notes:
svn path=/head/; revision=355774
|
|
|
|
|
|
|
|
|
|
|
|
| |
While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.
Differential Revision: https://reviews.freebsd.org/D20999
Notes:
svn path=/head/; revision=355721
|
|
|
|
|
|
|
|
|
|
|
| |
Don't needlessly pass around qpair pointers when the tracker knows what
qpair it's on. This will simplify code and make it easier to split
submission and completion queues in the future.
Signed-off-by: John Meneghini <johnm@netapp.com>
Notes:
svn path=/head/; revision=355465
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Notes:
svn path=/head/; revision=352630
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The NVMe standard (1.4) states
>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.
However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.
MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
Notes:
svn path=/head/; revision=351828
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back. If there is a detach call in such state,
just stop all activity and free resources. If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Notes:
svn path=/head/; revision=351352
|
|
|
|
|
|
|
| |
MFC after: 2 weeks
Notes:
svn path=/head/; revision=350553
|
|
|
|
|
|
|
| |
MFC after: 2 weeks
Notes:
svn path=/head/; revision=350529
|
|
|
|
|
|
|
|
|
| |
While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.
Notes:
svn path=/head/; revision=350147
|
|
|
|
|
|
|
|
|
| |
Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.
Notes:
svn path=/head/; revision=350146
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.
Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988
Notes:
svn path=/head/; revision=350118
|
|
|
|
|
|
|
|
|
|
|
|
| |
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.
Reviewed by: jhb and markj (both prior versions)
Differential Revision: https://reviews.freebsd.org/D20478
Notes:
svn path=/head/; revision=348495
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
retries.
When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.
Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
Notes:
svn path=/head/; revision=344955
|
|
|
|
|
|
|
| |
to comment (code already does this)
Notes:
svn path=/head/; revision=344736
|
|
|
|
|
|
|
|
|
|
| |
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
Notes:
svn path=/head/; revision=344642
|
|
|
|
|
|
|
|
|
| |
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.
Notes:
svn path=/head/; revision=344640
|
|
|
|
|
|
|
| |
MFC after: 1 month
Notes:
svn path=/head/; revision=342546
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).
This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.
This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.
Bump __FreeBSD_version to 1200081 due to API change
Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
Notes:
svn path=/head/; revision=338182
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Summary:
Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.
Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see controller resets often. With this change, I
don't see these resets messages, though another tester still does, for yet to be
determined reasons, so this may not be a complete fix. Additionally, I see a
~5-10% speed up in buildworld times, likely due to not needing to reset the
controller.
Reviewed By: jimharris
Differential Revision: https://reviews.freebsd.org/D16570
Notes:
svn path=/head/; revision=337273
|
|
|
|
|
|
|
|
|
|
| |
dma_tag_payload should not be destroyed before payload_dma_map, and seems
it should be used there instead of dma_tag to match creation.
Sponsored by: iXsystems, Inc.
Notes:
svn path=/head/; revision=333127
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.
This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.
PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix
Notes:
svn path=/head/; revision=331046
|