Storage Wars

Storage Units

Welcome to the new year. I hope you had a wonderful time with friends and family. I helped run a non-profit event. My body is still recovering from the effort to ensure everything worked. I know I said I'd focus on libraries and archiving. But, if there's interest in talking about my time there, let me know.

Today, I want to discuss the tough question we all face: how much space do you need to store everything?  I am facing this question and have tried to circle a logic behind coming up with a realistic answer. While doing this process, I realized that we must make sacrifices.

For those who know me, welcome back. If you are new here, thanks for stopping by. My name is Patrick Burden; I am currently the Digital Collections Archivist for WVU. I have over 17 years of experience in the IT field before making the jump to archiving.

Why you can't keep getting a bigger drive

No this is not your server

When I took this role, the team brought this first issue to my attention.  Right now we have everything on a single drive.  If you are a small institution, this may be the way to go. For WVU, they were hitting critical mass.  By having everything on a single drive, we are unable to perform critical IT tasks to protect our files.

IT hat is coming on for this. With hardware arrays of hard drives, the size of the drives limits you.  The one that you are going to be familiar with is the Redundant Array of Independent Disks aka RAID.  You take your data and spread it across many hard drives. If a damaged hard drive occurs, you can use math to recover the data onto a replacement drive. This will limit data loss.

Yet, if your drive is too big, you are unable to do this math. Instead, you rely on the whims of getting larger drives to have this protection. This means I have to tell everyone the bad news. The drive that holds everything will be split into many parts.

For the most part, people saw this as a reasonable compromise. Some still complained about remembering where to find things on the X, Y, and Z drives.

Finding your magic number

You throught I was going to do a 42 reference here?

From IT’s perspective, they could care less about what goes where.  What they want to know is how big the drives are. This way, they can buy something to hold the files and run automated jobs to keep them intact. The most common methods are regular backups and a file recovery system for users.

So in the IT sense, what they want you to do is the following math equation:

Get the current size of files and folders on the network. Add the current year's delta of files to the storage. Or, use the size of the files you created last year.  Multiply that by the length of the contract for their storage vendor. Add the special projects you will do in the coming years. Use previous projects to estimate how much space they will take. Subtract the space from records retention policies or unused files. Voila, you have a number.  Let’s make this an actual equation.

Current Files + (ΔLast Year Files * Length of Contract) + (Number of Special Projects * Average size of files in previous projects) – Size of files up to be removed via Records Retention Policy = Estimated Hard Drive space needed.

If you look over the above equation, you might have spotted the issue digital archives have.  First I do not know how many special projects are going to come.  Donors will come out of the blue and go through the ascension process. Once they realize you are going to be the arbiters of their work, they will put a burden on your institution.

Next the size of the donation is going to vary from person to person. Some are a group of floppy disks, others a burned CD. You could have a jackpot in having a backup of their entire file server!

Many institutions are asking for the largest amount due to uncertain outcomes.  This is a bad idea for a few reasons.  First storage costs a LOT of money. Overspending will eat into your budget. You'll waste it on unused infrastructure.

Next, overestimating allows for it to be removed later to cut costs.  This downsizing will ALWAYS happen right before a gold mine will appear.  Then the realization will happen that you cannot accept this because you are not only out of space. Having a lack of an emergency fund will prevent you from revamping the storage again.

Finally, you are going to get lazy with files. By creating extra space, you won't have to cut away the extra baggage you've been carrying.  I cannot tell you how many times I have used Wiztree on a drive to find the largest files. Only for it to be a recording of something that is available elsewhere. That space is for another donor.

Digital Archives are lazy when it comes to storage. They swim in a sea of storage. They overspend their budget to keep their things. Yet, they see no real benefit in the proper maintenance of them.  In reality, we need to get help from the dark side to help with this problem: records managers.

Joining the dark side

A T-shirt that some in my family legitimately own

The Record Manager wants to ensure that someone destroys the file once it is no longer needed.  Retention schedules are set, and automated jobs take place. The system gives no mercy to files purged from the collective memory. It is to protect the institution from lawsuits in the future.

When I attended an SAA event, I was in a non-recorded talk about how to deal with records managers. and the need to try to claw every single inch of files away from their hands and hoard it like a pile of gold. What may be worthless today could be pivotal tomorrow, so we must save everything away from them.

The feeling was like treating the RM like the enemy when in reality they should be a partner. While the goals are not going to align perfectly with one another. You do share the same mission: protect files at all costs.

They have the foresight to remove files that are no longer needed. Archivists are not in the same boat. We must accept that we have to say no to donations. I know the feeling of letting something valuable slip through your fingers is rough to do.  But, unless you have a backlog of zero, it is going to be okay to tell them that we cannot take it right now.

By factoring in size in the pre-acceptance process, we can offload some responsibilities. Or try to come to a solution that does not depend only on our infrastructure.

So ultimately, I revised the equation to be the following:

Current Files + [ (ΔLast Year Files * Length of Contract) * 1.2 ] – Size of files up to be removed via Records Retention Policy = Estimated Hard Drive space needed.

When developing the business case, a 20 percent overhead will seem reasonable. This will show you care about the funds. It will lead to better decisions going forward.

With this magic number, you think you have finished, right?  Well now you have another issue to consider.  Where are you going to place things?

Gerrymandering your data

No matter how you slice it, the data can be shaped into anything

Once you have the size of the hard drives needed, the next step is to determine how many drives you are going to use.  A drive in this sense is going to be a distinct mapping that you are going to connect to. Let's say, for the sake of argument, that each drive is on a different device. This will make the next steps easier to manage.

Now, when it comes to file management, there are many ways you can slice your files to make logistical sense.  WVU wants to follow the OAIS model, so in this example, I am going to mirror that. Before anyone comments that this is not ideal for digital archiving, hold that thought. I will post about it in the future.

For the sake of this example, I am going to assume you are going to do something like this model. If you use something different in your organization, you will want to slice things in a different way. But, the principles will still apply.

Getting Guidance from your models

You can always count on some things, death, taxes, and the OAIS model being shown in a Archives related topic

In the OAIS world, you must consider three types of groups.  The SIP, AIP, and the DIP. S is for when the files are first transferred to the institution. A is for the time you spend preparing the files for the final arrangement. D is for when someone makes the files public to your presentation software. Yes, it does spell out SAD.

We can use this method to split files into smaller chunks. A “drive” acts as a marker, moving through the model from one stage to the next. Let’s begin this with a logical approach.  First you have the files donated. You need to place them into a storage location that is going to be your backlog. No, this cannot be the B: drive for reasons that I do not want to confuse you with.

In an ideal world, we'd all follow the inbox zero method. Our backlogs would be the top priority and remain almost empty at all times. If you are unable to do this (like I am), you will treat this location like other file system servers. Follow the 3-2-1 backup rules from the infrastructure. Aside from obtaining the initial fixity values, you will not do much until you are ready to process it.

Now you could leave the original media on a shelf until you are ready. You will also need to check these drives on a scheduled basis. Having a drive doing nothing can lead to mechanical issues. This makes this a great option for small institutions willing to take the risk.  Again, consult your IT person for advice on this.  For me, I would aim for the size of this drive to be a pit stop between your main systems and your network.  So the size should be enough for a backlog with some extra space in case you get new donations.

Going Cold or Merging Together?

Should you freeze your data in a sleep like state or keep it with your active files

Next, you are going to want to split the files into two buckets. You will create the first bucket as a 1-to-1 copy of the files you received from the donor to place in cold storage. In an ideal world, no one will ever need to touch these files again. Once you do this, you can store them in Amazon Glacier. It's a slow-access drive for long-term use. Unless all else fails in the backup and restore processes, this will be the original order of the files.

The next area is going to act as your working space. Most of your IT resources will focus here. This area represents the files you will access and those other processes will use.  Can you get away with combining the original order files with this space?  Yes. Be warned, this will complicate other life-cycle processes and use more storage. Splitting them by this line can get you a cheaper cold storage rate. It will optimize for the files you will use most.

Now in the working folder, you can edit and change to your heart's content. You have the original backup if needed. So, now, prepare the files for your Archival Repository Software. It must accept and map them to its back end.  Remember, ArchivesSpace will NOT store the files, only the metadata. So, the best you can do is point the file to its location on the server. Also, specify the presentation system you will use.

Now we enter the optional step. Once the working files are ready, we can pack them up using bag-it and have it as a representation of the files. You could locate this in its own drive and use it as your SIP package. Or you can leave the files as is and have the folder be the package onto itself. You can choose either way. Check with your organization on how people will access the files in-house.

Going Public

Letting the customers know that they are able to see your collections

The last section is where the general public will access. If a restricted file exists, it will not go here.  This is also when you are going to want to normalize the files in a friendly format.  Your JPEGs, PDF/As, WAVs, and other formats will not need special software.

The good news is that these systems will use the standard 3-2-1 backup rules and file restore methods. For the IT staff, this means you are limiting the tougher backup policies to locally accessed files. This will prevent them from getting a bunch of headaches.

So, to summarize, you are going to slice the drives into the following levels:

I have to use IT’s favorite answer when it comes to your institution, “it depends.” There are too many edge cases to find a one-size-fits-all solution to the storage problem. I hope this gives you a starting point. Please think about your processes. Adapt this method into your budget. It should help you create something that works for you.

While the new year is coming forward, I have been stockpiling ideas to write on here. Expect more updates this on a more frequent basis as I work on other projects.  Until next time, take care.