I Spent Way Too Much Time Thinking About Checksums

Every so often, someone gives you a task that should be simple to do. Something so simple and relaxed requires very little effort. This is true for our topic, but the restrictions turned it into a task that took much more time than I'd like to admit. What am I talking about? Checksums!

For those who know me, welcome back. If you are new here, thanks for stopping by. My name is Patrick Burden; I am currently the Digital Collections Archivist for WVU. I have over 17 years of experience in the IT field before making the jump to archiving.

Before I go down the rabbit hole into this solution, a good place to start is to explain what checksums are. A checksum is a unique value generated from a math equation. You enter a series of numbers, and it gives you a unique value. Here’s the idea: I’ll give you the same numbers. You will plug them into the math equation. After that, compare your answer to mine. If they are the same, then we’ve succeeded!

An image of a fingerprint embedded on a circuit board

As you would have guessed, I am not giving you a long list of numbers. Yet, when you think about files, that is what they are. On the lowest level, every file you have interacted with is a series of ones and zeros. So in theory, if I were to send you a file and you want to make sure that you received it, you can ask for me to generate a checksum. As long as we both use the same algorithm, we should get the same result. If it does not match, then you can either ask me to resend the file or the damage may have occurred to the file. This is where the usefulness comes in.

A child looking down at a toy, it is pixelated making it hard to see

From the digital archives' view, checksums verify file integrity. If you have a few files, it's easy to open them now and then. This way, you can verify that the information appears correctly. When that number goes into the millions, not so much. Ignoring files like paper may seem okay, but they can get damaged over time. Catching changes early will ensure that you are able to restore the file to its original glory.

In practice, we combine this with backups to determine when a file gets corrupted. In making backups, you want at least three copies of every file. When a checksum changes to a known value, it means the data is likely damaged. The best way to describe this is by using an example:

A Practical Example

Let’s say you are writing a research paper and are ready to turn it in. You then save the file as a PDF to mark it as a completed assignment. When you make the PDF, you also generate the checksum and save it into a different folder. In this checksum file, you should list the file path, file name, checksum, and the algorithm you’re using. From there, you also make a backup of the files on your computer. You could automate one to a local hard drive. The other could use a cloud backup, like SharePoint, OneDrive, or Dropbox.

Time moves on, and the professor is unable to mark your assignment since he can not open the file. This mistake could stop you from graduating. Oh no! You need to resend the file and have it graded to get your degree. Before sending the file, create a checksum. Then, compare it to your saved version. This is when you notice that the sums do not match!

Someone has tampered with your file, or is it? Verify the file on your local backup. Run the checksum, and it should match the result. Now you have two files, with two different checksums, one of which you have verified. How can you be sure you didn’t create an incorrect checksum all those years ago? What if that’s why you’re getting a false positive?

This is where file three comes in. When generating the checksum for that, you notice that it matches the results on file two. Using logic, you can conclude with certainty that the file on your computer has an issue. From there, you can replace it with one of your backups and send the file away so that you can graduate!

This scenario looks at one file, but with millions, it gets more complicated. I'm working on solving some debt. This is a practice we didn't use before, but we need to start applying it now. There is one other golden rule I also need to follow: I am not allowed to change the file names at all. We will consider this as the original files, and we need to maintain the names and order. If you were in my shoes, how would you go about doing this?

The Easy Way: Powershell

Now I am going to make an assumption: you are using a modern version of Windows. Making assumptions is often unwise. Yet, given the OS's popularity and its overlap with digital archiving, I may have been right. I know you want something quick and easy. You want to run it on your computer now, with no install needed.

This is when I am going to introduce you to PowerShell. To open the program, you are going to want to locate the search bar in the Start menu and type in powershell. Once the results load up, click on the application and it will run. Now you will notice that the screen will appear and there is not much to look at. This is because it is a text-only interface. Don’t panic, all we are going to do is type out the following:

get-FileHash -Algorithm [A] -Path (Get-ChildItem "[Root Folder]\*.*" -Recurse) | export-csv -Path "[Location Path]\[Filename]" -NoTypeInformation

Anything in the square brackets, [ ], will need a change. Let’s go over what they mean. First the [A] is going to be an algorithm; for our case, you are going to use one of two options: MD5 or SHA-256. You can choose others, but I have not encountered any of those in practice.

MD5 is simple and became popular early on the internet. But it has a problem: you might get the same results for different inputs. This happens because the designers designed it for simplicity. SHA-256 is the updated version, but it is more complex. Meaning it will have longer values, take longer to generate, and be more complex. You will be fine with MD5. Unless you are supercritical with files that are of urgent importance, it will be okay.

Next is [Root Folder] this is where we are going to scan. In another window, navigate to the folder and contents you want to scan and click the address bar. It will toggle between a classic display and a modern display; we want the classic one to enter in. Here is an example of what it should look like from a folder on a desktop.

C:\Users\username\Desktop\Original

Copy that value and paste it into the place. This tells the computer to find all files and sub-folders in the root folder. That means that if you were to do this with C:/, you would be doing the entire computer, and it will take some time.

The next part is telling the program to calculate the results and put them into a file. The [Location Path] is the folder you would want to save it to, and the [Filename] is what you want to call it. Save it as a .csv file. This lets you sort the columns, which makes searching for results easier later.

Once you fill in the information, hit enter. Nothing will happen. Well that isn’t true; it is doing things in the background and now showing you the information. Minimize the window and check on it later. You’ll know it’s done when you see the file created and can type in PowerShell again. Opening up the file will allow you to see the results in columns for ease of use, awesome.

But what happens when you are working with file names that are not in a traditional format? Things like Big [file.doc will cause an error and prevent this command from working. In an ideal world, you can rename the file to add the ending ] and carry on. The issue is what happens if they are the original files and someone cannot change them? Well buckle up because we are going to use some Linux.

The Involved Way – WSL to the Rescue

Before we get started, I am going to take care of the “oh why didn’t you use X question.” I settled on this solution after going through 17 other options. Many things on Digital Powerr either don't work, lack support, or can't manage many files. After many MANY attempts to get this working on something simple, I had to concede and use Linux.

Now the idea of having you boot up into a separate OS does sound like a laugh. I want to keep this as simple as possible. The good news is that there is an answer. Windows Subsystem for Linux (WSL) will allow you to run Ubuntu. We are going to use this to run the application that will generate the checksums.

The icon of the Windows Terminal

The first step is to install WSL. The good news is that it is two commands in powershell.

The first command you are going to want to run is

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux.

This tells Windows that you are going to want to install this to your computer and to have it enabled. The next step is to install it.

Wsl –install

This will install Ubuntu by default; once done, you are going to want to restart your computer. Once restarted, the Ubuntu installation will pop up, and we are going to want to enter in a username and password. Make it something you are going to remember or write it down somewhere in case you are forgetful.

The next step is to update the system. The install will not be current, and doing so now is going to make life easier later down the line. To do so, enter the following.

sudo apt update && sudo apt upgrade

This will check for updates and then install them once it finds them. You may need to hit Y to continue the process, but once you complete it, close the window and you are good.

If you want to access WSL you can do so from the start menu like it is its own program. Or via the drop-down menu in the command prompt. Whichever you choose, make sure you have it up and running.

A picture of windows terminal with the selector expanded

The next step is to connect this to your computer. At the moment, we are in our own sandbox and are unable to access the files that we are going to want to make the checksums. To connect them together, we are going to want to mount a shared drive. To do this, we are going to enter two commands.

sudo mkdir /mnt/share

sudo mount -t drvfs ‘[File path]’ /mnt share

The first command is going to make a new directory (mkdir) and we are going to call it share. It will live in the /mnt section of the computer. This is where other devices connect to, so having it in the same place will make it easier. The second command is where we tell the computer what the share folder is going to point to. The file path could be something as simple as a folder on your computer or a network drive. Like mapping a drive to your computer, you are going to want to enter the full path to get this. Here is an example of one:

\\192.168.1.100\C$\My Folder

Once you finish, you connect and are ready to go.

From here, we are going to want to change the directory (cd) to go to the shared folder; to do so, enter the following.

cd share

You are now in the shared folder, which in turn will be the file path we specified earlier. Next, we are going to want to list (ls) the files to make sure we are in the right spot. To do this, enter ls. You should see the folders and files like you would on the computer but in text form. If you want to open a folder, run a cd with the folder name to navigate there. To go up a level, you can type in cd .. to get back to where you were.

Once we are in the folder, it is time to use find and sum to make our checksum file. Enter the following:

find -type f \( -not -name “[filename]” \) -exec [either md5sum or sha256sum] ‘{}’ \; > [filename]

So let's explain what we are doing. First, we are going to run the find command and search for files [find type -f]. Next we are going to exclude the file we are about to create, since it could cause us to loop over and over again. We can identify this by the filename itself [\( -not -name “[filename]” \)].

We will run either md5sum or sha256sum based on our chosen algorithm. Use [-exec md5sum] or [-exec sha256sum]. We are going to separate the checksum and the filepath by creating a divider [‘{}’ \]. Then we are going to run this command [;].

Finally, we are going to take the results and use the standard output (stout) and have it print into a file [> [filename]]. If we did not include this last section the program will still run but it will show you all the checksums to the screen rather than to something we can save for later.

In this case, we are going to want to save it as a .txt file since the output is going to be a string of text. Once you run this, you will see nothing again. The process is working in the background, and the file will update as time passes. When you see the $ on the WSL window, you know it has finished.

Some helpful tips

Once you have the file, you can move it to a desired location. Put them in a different folder. This keeps original files separate and avoids confusion. When you open the file, you are going to have many lines in this format:

[Checksum] .\[filepath]

The .\ is the location of where you started the command. Change this to the filepath you will use. This helps others find the file. A double space separates the values. If you want to change this delimiter, use a regular expression in OpenRefine: \s{2}. Or you can change it by doing a find and replace in your text editor of choice.

You can create a file with a .md5 or .sha256 extension. Some operating systems will provide built-in support for this. I left it out because your process will likely need it in a readable format for comparing files later.

Go forth and generate those checksums, practice good archiving, and take care.