Addressing the risks related to unstructured data through the use of object stores

Published: Thursday, 15 October 2020 08:14

Unstructured data is proliferating, overwhelming traditional storage architectures and creating both compliance and recovery risks. Matthew Dewey explains why object storage is a promising storage option to help organizations deal with the issue.

Object stores have long found a home in the cloud and inside data centers / centres, becoming long-term repositories for high-value data - and for good reason. But with demand for storage capacity growing exponentially, how can organizations reap the benefits of object stores without costs spiraling out of control?

Data retention and protection are key

Object stores abstract away the location of an object, enabling higher levels of redundancy. This protects against device failure – as well as failures of entire nodes or even entire data centers. Abstracting away object location also enables object stores to scale to sizes and topologies difficult to achieve with file systems. The user of an object store may not know exactly where their data is physically stored. What looks like a single object store may be distributed across locations in multiple cities to achieve greater reliability against natural disasters. This level of durability could tremendously increase the capacity requirements of the underlying hardware, but with smart erasure coding algorithms, it can be achieved using less capacity than by mirroring the data.

A key issue for many entities deploying object storage is data retention. The retention period for many kinds of data is specified by legal and other compliance constraints. One might expect that data not subject to compliance requirements is likely to be deleted sooner, but some data has value indefinitely. For example, geological and genetic data doesn’t have an expiration date. These kinds of data sets can represent a significant investment and the data never expires.

The cost challenge

Demand for storage capacity is growing at a compound annual growth rate (CAGR) of more than 20 percent. Long-term repositories must become cheaper and deeper without losing durability. That means object stores must lower the overall total cost of ownership (TCO) of storage. This includes not just the cost of the media, but associated expenses of owning a piece of equipment like the costs of acquisition, maintenance, power, cooling, and the enclosing building and the land it sits on. The majority of today’s object stores are hard disk-based, which provides good performance and reliability, but the cost of power and physical footprint is significant. To lower TCO, object stores are incorporating tape.

The power of tape

Tape has lower media costs and, unlike disks, requires minimal power and cooling when not being accessed. Tape tiers are ideal for large amounts of data stored for long periods of time. And for sequential IO, tape actually outperforms disk for both reading and writing.

But tape works (and fails) differently from disks. The latencies to access data on tape can’t be ignored. Best practice implementations will present tape as a separate tier to allow applications to help manage data access.

Exploiting the full advantages of tape in an object store also requires a deep understanding of how to properly manage and treat it. The object store must account for and survive failure modes that are unique to tape. It must also manage access patterns to reduce tape latencies and wear. And all of the required complexity must be implemented below the object interface, saving the user from experiencing it.

Because tape excels at sequential access, large individual objects will perform best. However, a well- implemented object store will group small objects into larger sequential streams to and from tape.

With the right expertise, organizations can implement a tape-based object store for long-term data retention while keeping storage costs firmly under control.

The keys to success: data cataloging and management

Object stores are often petabytes or even exabytes in size. Yet objects are often in the range of kilobytes to megabytes in size: there are, potentially quadrillions of objects in an exabyte-scale object store. How do we know what is in the data store? How do we identify and select complete subsets of information in a pool of data this vast?

A catalog / catalogue of the contents of the object store replaces human memory - it is a mechanism for selecting subsets of the data. Data must be classified as it is added to the object store. The initial classification would include the standard attributes and might include domain-specific classification as well. Over time, the uses of data and the information we can extract from it will change and improve. This requires the classification information be malleable in ways users cannot predict when data is added.

What decides what objects go on which storage media? How is that decision made? How do we select the sets of data that are needed for a task? Proper data management ensures the data is where the user needs it, when it is needed. Once an item of interest is identified, the system ensures the data are placed for optimal processing. Conversely, the data needs to be where it is can be kept safe for the lowest cost when it is not in use.

Properly cataloged and managed object stores may be the best bet for keeping unstructured data as an affordable asset for the future. 

The author

Matthew Dewey is Technical Director at Quantum.