Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

This is the User Manual for the Wash U IT Research Information Services (RIS) Storage Service.

...

Product Stage: General Availability

Assumptions

If you are reading this document, it is assumed that you are a member of the Washington University user community, and that you are related to the research mission of the University. We assume you have a Wash U WUSTL Key Identity, and that you are or work for Research Faculty or Staff. We assume you are on local Wash U computer networks or that you have access to either the Wash U Medical School or Danforth VPN. (See How do I know what network I am on? in the FAQ below.)

...

The Storage Service Workflow

The summary of steps to enable and consume a RIS Storage Allocation is as follows:

...

Optionally, consider sub-dividing your Allocation into Project Subdirectories. These are sub-units of your Allocation that might need different access controls. If you have different data sets that you would like to control access to, call those “Projects” and give them a name. Then indicate an Access List of WUSTL Key IDs that are to have read-write access, and optionally a separate Access List for read-only access.

Features And Options

There are a number of features and options related to the RIS Storage Service.

  • Integrated with WUSTL Key ID

  • Integrated with RIS Data Transfer (Globus)

  • Integrated with RIS Compute services (See this FAQ).

  • Snapshots

  • Archive Tier

  • Active Tier with seamless expansion

Storage Tiers: Archive

We use the word “tier” to refer to a “performance level” of the storage service. Currently there are two tiers, “Active” and “Archive”. The Active tier is the standard storage tier you get by default. It is serviced by a number of different storage pieces including fast memory caching etc., but the way an End User should think of it is “Active storage is where I do daily work”. Think of it like “spinning disks” even though it’s more complicated than “just” spinning disks.

...

RIS intends to expand tiers in the future to include a “local” tier, that is directly attached to a Compute Service execution node, a “cloud” tier that is connected to cloud services like AWS or GCP or Azure.

Snapshots

Within the “Active” storage tier there is a directory named “.snapshots” that contains one week of daily snapshots of the Active storage space. If files in your Active space get overwritten, corrupted, or mistakenly deleted, you can copy previous versions out of the .snapshots directory back into Active.

Storage Tape Backups

  • The backup policy for both Active and Archive data has been fully vetted and approved by Office of the Vice Chancellor for Research and the Office for Information Security.

  • The research storage infrastructure has been deemed compliant with all data retention guidelines.

  • Integrated into the storage environment is a high performance and scalable tape robot that manages a tape library of 18 petabytes, which allows the shuttling of data from live disk to much less expensive tape and back again on demand.

  • For both Active and Archive filesets, data remains on tape indefinitely unless it has been deleted on disk. Active data remains on tape for 90 days after the data has been deleted from disk.
    • If the data is never deleted from disk, then it remains on tape indefinitely with incremental backups.

    • Data in Archive also remains on dual copy tape indefinitely unless it is deleted, then it remains on tape for 10 days.

  • The research storage environment also offers self-service, snapshot data recovery for 7 days.

  • The preferred method of completed project retention is to request an Archive allocation and once a project is completed, the data can be moved from Active to Archive.

  • If the data needs to be accessed again after moved to Archive it can migrated back to Active, and it will be restored from tape to disk.

  • The preferred method of moving data between Active and Archive is to use tar or zip the data and use rsync for movement.

Enabling The Storage Service

  • Visit the RIS Service Desk, then on the left, click the Storage Platform section, and begin a Service Request for a new Allocation by selecting Activate a new storage allocation.

...

Please see our documentation for more information on activating a storage allocation

Enabling the Archive Tier

Enabling the Archive tier is done simply by asking for it. Put in a Service Desk request and ask that the Archive tier be enabled for a named Storage Allocation.

Getting Connected

How to connect to storage from MacOS

How to connect to storage from Linux

How to connect to storage from Windows

Designing a Storage Layout

When you connect to your Storage Allocation, there is a standard filesystem layout:

...

Info

When creating directories or files, it is best practices to avoid using spaces within the name. If you need to separate parts of a name, it is highly recommended that you use dashes - or underscores _.

Linux environments do not handle spaces in names well and when it comes to interactions with the Compute Platform, spaces within names of directories and files create issues affecting operation.

There is a 255 character limit on NTFS file name sizes. It is recommended that you be precise in your naming as well. This is a hard limit of the system that the Storage/Compute platform uses. Any files to be transferred to Storage/Compute need to be created following this limit or they cannot be transferred.

Moving Data Into The Storage Service

Info
iconfalse
titleCompute Data Transfer Policy

Please see our Compute Data Transfer Policy if you will be transferring data to and from your storage allocation using compute1.

CHPC

Instructions for moving data from CHPC

Globus

Instructions for moving data with Globus:

Globus CLI

Instructions for moving data with Globus CLI:

Globus Connect Personal

Instructions for installing and using Globus Connect Personal:

Rclone

gsutil

Instructions for moving data from Google storage with gsutil

Access Control

Instructions for how to manage access to your data in the Storage Service.

Known Limitations

Anchor
ris-limitations
ris-limitations

The Storage Service includes a feature set documented in these pages. Each feature or capability has limitations or caveats.

Calculating Free Space

Use SMB to determine free space in a Storage Service Allocation

...

Code Block
languagepython
linenumbersfalse
$ du -sh --apparent-size /storage1/fs1/ris/Active/

Active Directory Group Management

Members May Be Removed From Groups

...

Code Block
languagepython
linenumbersfalse
$ getent group storage-ris-itsm-rw
storage-ris-itsm-rw:*:1250923:david.prince,shawn.m.leonard,dhallan,jansen,catherine.morie,tz-kai.lin,sleong,cspohl

Ignoring umask

When any file or directory is created with an inherited Access Control Entry (ACE), the POSIX “umask” will be ignored. The umask normally determines basic traditional POSIX permissions on new files. By default, all folders in an allocation will have inherited heritable permissions, and thus display this behavior. In order to have the permissions on a new file reflect the setting of umask, files must be created in directories with ACLs modified to exclude inheritance flags or entries. The relevant vendor (IBM) and IETF (see NFS ACL RFCs) confirm this is the intended behavior. An example of where this might cause an issue is with the usage of git repositories containing permissions settings that conflict with the default ACLs.

Security Implications of SMB

Protocols like SMB evolve over time as a result of feature changes or security vulnerabilities. We expect users to use SMB3.

Early Access, Design Changes, Implementation and Integration

FAQ