Discussion:
Intel NVMe troubles?
(too old to reply)
Borja Marcos
2016-07-28 10:29:58 UTC
Permalink
Hi :)

Still experimenting with NVMe drives and FreeBSD, and I have ran into problems, I think.

I´ve got a server with 10 Intel DC P3500 NVMe drives. Right now, running 11-BETA2.

I have updated the firmware in the drives to the latest version (8DV10174) using the Data Center Tools.
And I’ve formatted them for 4 KB blocks (LBA format #3)

nvmecontrol identify nvme0ns1
Size (in LBAs): 488378646 (465M)
Capacity (in LBAs): 488378646 (465M)
Utilization (in LBAs): 488378646 (465M)
Thin Provisioning: Not Supported
Number of LBA Formats: 7
Current LBA Format: LBA Format #03
LBA Format #00: Data Size: 512 Metadata Size: 0
LBA Format #01: Data Size: 512 Metadata Size: 8
LBA Format #02: Data Size: 512 Metadata Size: 16
LBA Format #03: Data Size: 4096 Metadata Size: 0
LBA Format #04: Data Size: 4096 Metadata Size: 8
LBA Format #05: Data Size: 4096 Metadata Size: 64
LBA Format #06: Data Size: 4096 Metadata Size: 128


ZFS properly detects the 4 KB block size and sets the correct ashift (12). But I’ve found these error messages
generated while I created a pool (zpool create tank raidz2 /dev/nvd[0-8] spare /dev/nvd9)

Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:63 nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6 cid:63 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:62 nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6 cid:62 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:61 nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6 cid:61 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:60 nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6 cid:60 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:59 nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6 cid:59 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:58 nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6 cid:58 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:57 nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6 cid:57 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:56 nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6 cid:56 cdw0:0

And the same for the rest of the drives [0-9].

Should I worry?

Thanks!




Borja.
Jim Harris
2016-07-28 17:25:29 UTC
Permalink
Post by Borja Marcos
Hi :)
Still experimenting with NVMe drives and FreeBSD, and I have ran into
problems, I think.
IÂŽve got a server with 10 Intel DC P3500 NVMe drives. Right now, running
11-BETA2.
I have updated the firmware in the drives to the latest version (8DV10174)
using the Data Center Tools.
And I’ve formatted them for 4 KB blocks (LBA format #3)
nvmecontrol identify nvme0ns1
Size (in LBAs): 488378646 (465M)
Capacity (in LBAs): 488378646 (465M)
Utilization (in LBAs): 488378646 (465M)
Thin Provisioning: Not Supported
Number of LBA Formats: 7
Current LBA Format: LBA Format #03
LBA Format #00: Data Size: 512 Metadata Size: 0
LBA Format #01: Data Size: 512 Metadata Size: 8
LBA Format #02: Data Size: 512 Metadata Size: 16
LBA Format #03: Data Size: 4096 Metadata Size: 0
LBA Format #04: Data Size: 4096 Metadata Size: 8
LBA Format #05: Data Size: 4096 Metadata Size: 64
LBA Format #06: Data Size: 4096 Metadata Size: 128
ZFS properly detects the 4 KB block size and sets the correct ashift (12).
But I’ve found these error messages
generated while I created a pool (zpool create tank raidz2 /dev/nvd[0-8]
spare /dev/nvd9)
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:63
nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
cid:63 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:62
nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
cid:62 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:61
nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
cid:61 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:60
nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
cid:60 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:59
nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
cid:59 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:58
nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
cid:58 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:57
nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
cid:57 cdw0:0
Jul 28 13:16:11 nvme2 kernel: nvme0: DATASET MANAGEMENT sqid:6 cid:56
nsid:1
Jul 28 13:16:11 nvme2 kernel: nvme0: LBA OUT OF RANGE (00/80) sqid:6
cid:56 cdw0:0
And the same for the rest of the drives [0-9].
Should I worry?
Yes, you should worry.

Normally we could use the dump_debug sysctls to help debug this - these
sysctls will dump the NVMe I/O submission and completion queues. But in
this case the LBA data is in the payload, not the NVMe submission entries,
so dump_debug will not help as much as dumping the NVMe DSM payload
directly.

Could you try the attached patch and send output after recreating your pool?

-Jim

Thanks!
Post by Borja Marcos
Borja.
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
Borja Marcos
2016-07-29 08:10:16 UTC
Permalink
Post by Jim Harris
Yes, you should worry.
Normally we could use the dump_debug sysctls to help debug this - these
sysctls will dump the NVMe I/O submission and completion queues. But in
this case the LBA data is in the payload, not the NVMe submission entries,
so dump_debug will not help as much as dumping the NVMe DSM payload
directly.
Could you try the attached patch and send output after recreating your pool?
Just in case the evil anti-spam ate my answer, sent the results to your Gmail account.




Borja.
Jim Harris
2016-07-29 15:44:50 UTC
Permalink
Post by Jim Harris
Post by Jim Harris
Yes, you should worry.
Normally we could use the dump_debug sysctls to help debug this - these
sysctls will dump the NVMe I/O submission and completion queues. But in
this case the LBA data is in the payload, not the NVMe submission
entries,
Post by Jim Harris
so dump_debug will not help as much as dumping the NVMe DSM payload
directly.
Could you try the attached patch and send output after recreating your
pool?
Just in case the evil anti-spam ate my answer, sent the results to your
Gmail account.
Thanks Borja.

It looks like all of the TRIM commands are formatted properly. The
failures do not happen until about 10 seconds after the last TRIM to each
drive was submitted, and immediately before TRIMs start to the next drive,
so I'm assuming the failures are for the the last few TRIM commands but
cannot say for sure. Could you apply patch v2 (attached) which will dump
the TRIM payload contents inline with the failure messages?

Thanks,

-Jim
Borja Marcos
2016-08-01 14:38:18 UTC
Permalink
Post by Borja Marcos
Post by Jim Harris
Yes, you should worry.
Normally we could use the dump_debug sysctls to help debug this - these
sysctls will dump the NVMe I/O submission and completion queues. But in
this case the LBA data is in the payload, not the NVMe submission entries,
so dump_debug will not help as much as dumping the NVMe DSM payload
directly.
Could you try the attached patch and send output after recreating your pool?
Just in case the evil anti-spam ate my answer, sent the results to your Gmail account.
Thanks Borja.
It looks like all of the TRIM commands are formatted properly. The failures do not happen until about 10 seconds after the last TRIM to each drive was submitted, and immediately before TRIMs start to the next drive, so I'm assuming the failures are for the the last few TRIM commands but cannot say for sure. Could you apply patch v2 (attached) which will dump the TRIM payload contents inline with the failure messages?
Sure, this is the complete /var/log/messages starting with the system boot. Before booting I destroyed the pool
so that you could capture what happens when booting, zpool create, etc.

Remember that the drives are in LBA format #3 (4 KB blocks). As far as I know that’s preferred to the old 512 byte blocks.

Thank you very much and sorry about the belated response.





Borja.
Michael Loftis
2016-08-01 15:32:10 UTC
Permalink
FWIW I've had similar issues with Intel 750 PCIe NVMe drives when
attempting to use 4K blocks on Linux with EXT4 on top of MD RAID1 (software
mirror). I didn't dig much into because too many layers to reduce at the
time but it looked like the drive misreported the number of blocks and a
subsequent TRIM command or write of the last sector then errored. I mention
it because despite the differences the similarities (Intel NVMe, LBA#3/4K)
and error writing to a nonexistent block. Might give someone enough info to
figure it out fully.
Post by Jim Harris
Post by Borja Marcos
Post by Jim Harris
Yes, you should worry.
Normally we could use the dump_debug sysctls to help debug this - these
sysctls will dump the NVMe I/O submission and completion queues. But
in
Post by Borja Marcos
Post by Jim Harris
this case the LBA data is in the payload, not the NVMe submission
entries,
Post by Borja Marcos
Post by Jim Harris
so dump_debug will not help as much as dumping the NVMe DSM payload
directly.
Could you try the attached patch and send output after recreating your
pool?
Post by Borja Marcos
Just in case the evil anti-spam ate my answer, sent the results to your
Gmail account.
Post by Borja Marcos
Thanks Borja.
It looks like all of the TRIM commands are formatted properly. The
failures do not happen until about 10 seconds after the last TRIM to each
drive was submitted, and immediately before TRIMs start to the next drive,
so I'm assuming the failures are for the the last few TRIM commands but
cannot say for sure. Could you apply patch v2 (attached) which will dump
the TRIM payload contents inline with the failure messages?
Sure, this is the complete /var/log/messages starting with the system
boot. Before booting I destroyed the pool
so that you could capture what happens when booting, zpool create, etc.
Remember that the drives are in LBA format #3 (4 KB blocks). As far as I
know that’s preferred to the old 512 byte blocks.
Thank you very much and sorry about the belated response.
Borja.
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
<javascript:;>"
--
"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler
Jim Harris
2016-08-01 18:49:31 UTC
Permalink
Post by Jim Harris
Post by Borja Marcos
Post by Jim Harris
Yes, you should worry.
Normally we could use the dump_debug sysctls to help debug this - these
sysctls will dump the NVMe I/O submission and completion queues. But
in
Post by Borja Marcos
Post by Jim Harris
this case the LBA data is in the payload, not the NVMe submission
entries,
Post by Borja Marcos
Post by Jim Harris
so dump_debug will not help as much as dumping the NVMe DSM payload
directly.
Could you try the attached patch and send output after recreating your
pool?
Post by Borja Marcos
Just in case the evil anti-spam ate my answer, sent the results to your
Gmail account.
Post by Borja Marcos
Thanks Borja.
It looks like all of the TRIM commands are formatted properly. The
failures do not happen until about 10 seconds after the last TRIM to each
drive was submitted, and immediately before TRIMs start to the next drive,
so I'm assuming the failures are for the the last few TRIM commands but
cannot say for sure. Could you apply patch v2 (attached) which will dump
the TRIM payload contents inline with the failure messages?
Sure, this is the complete /var/log/messages starting with the system
boot. Before booting I destroyed the pool
so that you could capture what happens when booting, zpool create, etc.
Remember that the drives are in LBA format #3 (4 KB blocks). As far as I
know that’s preferred to the old 512 byte blocks.
Thank you very much and sorry about the belated response.
Hi Borja,

Thanks for the additional testing. This has all of the detail that I need
for now.

-Jim

Loading...