ReFS is Microsoft’s new file system that will one day replace NTFS. It offers many awesome new features, particularly if you are using Storage Spaces and lots of disks. It scales beautifully, and has no fixed capacity limitations that matter in this day and age.
I’m no technical expert on ReFS, but we’ve recently run into an issue with an ReFS Cluster Shared Volume in Windows 2019 that was nice and yucky. Essentially, we’d been experiencing some issues with a Windows 2019 Hyper V Cluster resulting in storage becoming unavailable. The error generated was a useless generic “An Unexpected Error Occurred.” We opened a support case with Microsoft, and after some basic testing came to the conclusion that the error might have been caused be a specific registry setting, and so brought our storage online.
Except that it didn’t come online. Instead we got hit with a classic ReFS RAW Volume; essentially a disk that Windows could see, but was unable to mount. Doing some investigation, it looked like all the data was there, but just the metadata was obstructed in some way. Windows provides “really helpful” error messages:
We really wanted to get the data that was on this disk to avoid having to roll back to an earlier backup.
After much anxiety (and googling) we decided to try a few strategies to get the data back. First of all, we tried ReclaiMe. ReclaiMe is a commercial tool that allows you to do data recovery from a number of volumes. ReclaiMe started out really well. It found all the files we were expecting to find on the volume, and displayed them in the tree. We provisioned a new NTFS volume and ran the data recovery; however the majority of the data we recovered was unreadable. We could see that it should be working, but couldn’t understand why it wasn’t.
I did a lot more reading on ReFS. It turns out that MS has several versions of ReFS, and the current version is 3.4 (as of Windows 2019 and the Windows 2016 1803 update). ReclaimMe identified the disk as a ReFS 2 volume. While the differences don’t necessarily explain why it might not work, I had a hunch that maybe the reason the restore wasn’t working correctly was because we were in fact on too new a version of ReFS for the ReclaiMe software to work.
A few weeks ago, Anton Gostev from Veeam wrote about a new, otherwise undocumented tool in Windows 2019 called “refsutil”. This tool provides a mechanism to triage and recover failed ReFS volumes. The post in the Veeam digest indicated that it existed, was probably good, but no one really knows what it does. There is almost no documentation or information about this tool, but it saved the day for us – and so I thought it was worth noting down some useful things we learnt about how to use the tool. Much of our learning was greatly assisted by this article.
You can see all of the options available in ReFSutil for data salvage by running “refsutil salvage” with no options.
Knowing that our volume had data, the first thing we wanted to do was to verify if ReFSutil could see the corruption. Fascinatingly, ReFSutil thought everything was fine:
C:\salvage>refsutil salvage -D E: C:\salvage -x -v Microsoft ReFS Salvage [Version 10.0.11070] Copyright (c) 2015 Microsoft Corp. Local time: 4/21/2019 2:29:57 Option(s) specified: -v -x ReFS version: 3.4 Boot sector checked. Superblocks checked. Checkpoints checked. No corruption is detected. Command Completed. Run time = 7 seconds.
To explain, “-D” says “diagnose” why the volume is failed. “E:” is the drive that was not mounting (our corrupted ReFS volume). “C:\salvage” was the directory where we were storing metadata about the recovery process. “-x” means “unmount the volume” before we go (if we didn’t do this we had access denied errors) and “-v” means be verbose about the output.
As described in the output above, there were no issues on the volume. Yet Windows would not mount it. This gave us a high degree of confidence we probably could recover the data. It also gives us a high degree of confusion as to what is actually wrong with the volume.
The next step was to run a Quick Scan to pull out all of the metadata we needed, and a list of files that ReFSutil was comfortable we could recover.
C:\salvage>refsutil salvage -QS E: C:\salvage\ -v -x Microsoft ReFS Salvage [Version 10.0.11070] Copyright (c) 2015 Microsoft Corp. Local time: 4/21/2019 2:27:17 Option(s) specified: -v -x ReFS version: 3.4 Boot sector checked. Cluster Size: 65536 (0x10000). Cluster Count: 486601728 (0x1d00f400). Superblocks checked. Checkpoints checked. 4363 container table entry pages processed (0 invalid page(s)). 1 container index table entry pages processed (0 invalid page(s)). Container Table checked. Processing 1 of 2 object table pages (50%)... Object Table checked. Examining identified metadata disk data for versioning and consistency. 9104 disk clusters analyzed (200%)... Examining volume with signature a0e4914d for salvageable files. 8726 container table entry pages processed (0 invalid page(s)). 2 container index table entry pages processed (0 invalid page(s)). Validating discovered table roots on volume with signature a0e4914d. 86 table roots validated (100%). Enumerating files from discovered tables on volume with signature a0e4914d. 86 tables enumerated (100%). Command Completed. Run time = 22 seconds.
With regards to the options, “-QS” said perform a “quick scan” to look for files on the disk. There is also a Deep Scan option that will scan on a block-by-block basis for data. We didn’t believe we needed this as there was actually no evidence of actual corruption. As before, “E:” is the volume that was not mounting, “C:\salvage” is the location we were saving our working data, “-x” meant unmount the volume before we begin and “-v” meant be verbose about the output.
This ran successfully, and created a bunch of output in our working directory.
Of these files, the most useful for us is the list of files that ReFSUtil thought it could recover. Here’s a sample of what that looks like:
Volume Signature: 0xa0e4914d ... Identified File: \VMSMB01\VMSMB01_TimeMachine.vhdx Size (0x428400000 Bytes) Volume Signature: 0xa0e4914d Physical LCN: 0x6186a = <0xc586a, 0x0, 0x0, 0x0> Index = 0x2 Last-Modified: 04/16/2019 04:56:56 AM TableId: 0x783'0 VirtualClock: 0x87668 TreeUpdateClock: 0x0 Identified File: \VMSWAN01\Virtual Hard Disks\VMSVWAN01_D.vhdx Size (0xb3e400000 Bytes) Volume Signature: 0xa0e4914d Physical LCN: 0x60265 = <0xc2a65, 0x0, 0x0, 0x0> Index = 0x2 Last-Modified: 03/25/2019 05:03:18 PM TableId: 0x735'0 VirtualClock: 0x64559 TreeUpdateClock: 0x2
One of the important things to note is that you can edit this file to build a subset list of files you want to restore; which means you don’t need to do everything at once, or can prioritise key systems and data.
Here is an example of just restoring the two files in the index above:
C:\salvage>refsutil salvage -SL E: C:\salvage F:\restore C:\salvage\restorefiles3.txt -v Microsoft ReFS Salvage [Version 10.0.11070] Copyright (c) 2015 Microsoft Corp. Local time: 4/21/2019 2:56:57 Option(s) specified: -v Processing C:\salvage\restorefiles3.txt 8726 container table entry pages processed (0 invalid page(s)). 2 container index table entry pages processed (0 invalid page(s)). Copying: \\?\F:\restore\volume_a0e4914d\VMSMB01\VMSMB01_TimeMachine.vhdx...Done Copying: \\?\F:\restore\volume_a0e4914d\VMSWAN01\Virtual Hard Disks\VMSVWAN01_D.vhdx...Done Command Completed. Run time = 30125 seconds.
“-SL” means copy all the files in the “Source List”. “E:” is again our corrupted volume. “C:\salvage” contains the metadata we extracted in the “QS” step. “F:\restore” is where we are putting our recovered data copies. “C:\salvage\restorefiles3.txt” is our edited data set containing the files we wish to restore, “-x” meant unmount the volume before we begin and “-v” meant be verbose about the output.
From here, we were able to reattach the disks to the virtual machines, and win!
So what did we learn?
- It’s probably not worth paying for a commercial data recovery tool for most ReFS failure scenarios. The built in ReFSutil is powerful, current and works for even recent versions of the ReFS file system
- Even if Microsoft Support tells you everything is OK, it’s probably worth double checking that it actually is. Sometimes they are wrong.
- Make sure you have current/recent backups. (We did, but we wanted to get the most recent data.)
- Think twice before you use ReFS in a cluster file system; the tools and techniques for dealing with problems relating to it are not as robust as those for other file systems – and we still have no root cause/reason for the corruption that occurred.
- Make sure you have tested the escalation process on your MS support case before your engineer goes off-shift, in case it’s magically broken.
(Many thanks to Lachlan, Dave and Dave who were instrumental in the process of puzzling this out.)