RCC logo

Working With Data: Data management plans, transferring data, data intensive computing

Big Data is a collection of data sets so large and complex that they are difficult to process using traditional (desktop environment) data processing techniques. The challenges of these data sets include the capture, curation, storage, search, sharing, transfer, analysis, and visualization of the elements they contain. Despite these problems, a clear trend to larger data sets is clear among UChicago research programs.

In response to these trends the RCC has established a technological framework to enable researchers to work with today’s tera- and petascale research problems.

Data Management Plans

Many funding agencies, including the NSF and NIH, require some form of data management plan to supplement funding proposals. The plan describes how the proposal will conform to the agency’s data integrity, availability, sharing, and dissemination policies. RCC can assist researchers with the development of data management plans by providing accurate descriptions of the data storage, versioning, backup, and sharing capabilities that are available to UChicago researchers.

It is imperative that RCC is included in the development of data management commitments that leverage RCC resources. Please contact us (info@rcc.uchicago.edu) to learn more.

The following is an example of text that can be used in your data management plans. It is critical that you contact RCC to review your data management plan if RCC resources are involved.

Note

Data will be stored on the University of Chicago Research Computing Center (RCC) Project Storage Service. Project Storage sits on a petascale GPFS file system managed by professional system administrators. Snapshots, point-in-time views of the file system, are maintained at hourly, daily, and weekly intervals; allowing researchers to independently recover older versions of files at any time. The system is backed up nightly to a tape archive in a separate location. Project Storage is accessible both through the UChicago HPC Cluster and Globus Online (endpoint ucrcc#midway), which provides straightforward high-performance transfer and sharing capabilities to researchers. Files can be shared with any Globus Online user, at the UChicago or elsewhere, without need for an account on the UChicago HPC cluster or other resources.

RCC Project Storage is connected to the UChicago campus backbone network at 10 gigabit/s, and to the UChicago HPC cluster, available to researchers for data analysis, at 40 gigabit/s. This professionally managed, reliable and highly-available infrastructure is suitable for capturing, generating, analyzing, storing, sharing, and collaborating on petabytes of research data.

Storage Systems

RCC manages three storage complimentary storage systems that each fill a unique niche on the research computing landscape. Briefly, Home and Project storage is maintained on a 1.5 petabyte file system that is both backed-up and version controlled with GPFS snapshots. Scratch storage is a 80 terabyte high-performance shared resource. HSM is a petascale-capable tape system that is useful for storing data that is not expected to be referenced again, but must be kept for the lifetime of a project to meet certain requirements or best practices.

Persistent Storage

Persistent storage areas are appropriate for long term storage. They have both file system snapshots and tape backups for data protection. The two locations for persistent storage are the home and project directories.

Home

Every RCC user has a home directory located at /home/CNetID. This directory is accessible from all RCC compute systems and is generally used for storing frequently used items such as source code, binaries, and scripts. By default, a home directory is only accessible by its owner (mode 0700) and is suitable for storing files which do not need to be shared with others. The standard quota 10 GB.

Project

Project storage is generally used for storing files which are shared by members of a research group. It is accessible from all RCC compute systems. Every research group is granted a startup 500 GB quota, though scaling individual projects to hundreds of terabytes is straightforward on the RCCs 1.5 PB system. Additional storage is available through the Cluster Partnership Program. Contact info@rcc.uchicago.edu to learn more.

Scratch Storage

Scratch space is a high-performance shared resource intended to be used for active calculations and analysis run on the compute clusters. Users are limited to 5 terabytes of scratch.

Storage Performance Considerations

RCC nodes are connected by two network fabrics - Infiniband (40 gb/s) and Gigabit Ethernet (1 gb/s). The fastest network available on a compute node is used for both interprocess communications and reading and writing to the shared storage systems. Accordingly, the time required to perform file system operations such as moving and copying data will vary according to the available network bandwidth.

Performance is additionally influenced by the characteristics of the file system that is used. Taken together, these factors can result in orders-of-magnitude differences in time to perform seemingly very similar operations. For example, consider the the table below which indicates the time required to operate on 24 gigabytes on project storage and scratch storage, from GigE nodes and Infiniband nodes.

  Project IB Scratch IB Project GigE Scratch GigE
24 GB (100 x 240 MB files) 40 20 500 500
24 GB (100,800 245 KB files) 1805 875 1150 1150

Purchasing Storage

Project storage can be purchased through the Cluster Partnership Program in units as small as 1 terabyte, or exceeding 100 terabytes. Contact info@rcc.uchicago.edu to learn more.

Transferring Data to the RCC

The RCC computing infrastructure is connected to the UChicago backbone network at XX Gb/s. This connection is rarely saturated; when you are transferring data to the RCC the time required to complete a data transfer is generally limited by the network bandwidth available to the machine you are transferring from.

The table below indicates the amount of time required to transfer a given amount of data, in a best-case scenerio, at an indicated speed.

10 Mb roughly corresponds to the speed of a modest home internet connection. The UChicago wireless network is capable of sustained 20-40 Mb transfers. Most of the data ports on campus are 100 Mb, although they can be upgraded with a request to IT Services (INSERT LINK) to 1 Gb in most buildings.

  10 Mb/s 100 Mb/s 1 Gb/s 10 Gb/s 100 Gb/s
1 PB 25 years 2 years 92 days 9 days 22 hours
100 TB 3 years 92 days 9 days 22 hours 2 hours
1 TB 9 days 22 hours 2 hours 13 mins 1 min
100 GB 22 hours 2 hours 13 min 1 min 8 sec
10 GB 1 hours 13 mins 1 min 8 sec <1 sec
1 GB 13 mins 1 min 8 sec <1 sec <<1 sec

Transfer times are variable, but it is reasonable to estimate based on this table that, for example, transferring 2 TB of data from an external hard drive that is plugged into a lab workstation via UChicago Ethernet (100 Mb) will take 40 to 50 hours.

Data Transfer Recommendation

The best way to transfer large amounts of data is with the Globus Online (https://www.globusonline.org/) data movement service. Use your CNet ID to sign in at globus.rcc.uchicago.edu. Globus offers a number of advantages over traditional Unix tools like SCP and rsync for large, regular, or otherwise complex file transfers. These include:

  • automatic retrys
  • email notifications when the transfer completes
  • command line tools to automate data transfers
  • increased speed.

Of course common tools such as secure-copy and rsync are also available through any remotely accessible interactive node.

Data Intensive Computing

The RCCs computing infrastructure enables researchers to perform data-intensive computing as well as flop-intensive computing. Elements that are particularly useful for dealing with large data are described below.

Infiniband Network

The nonblocking FDR10 40 gb/s Infiniband network provides up to 5 GB/s reads and writes to and from the shared storage systems.

Large Shared-Memory Nodes

A number of compute nodes are avaialble with very large shared-memory. These shared resources are available through the queue and are otherwise identical to other RCC compute nodes.

  • Two nodes are available with 256 GB of memory each
  • One node has 1 terabyte (1024 GB) of memory

Map Reduce Nodes

Ten compute nodes are available with very large (18 terabytes) local storage arrays attached to the nodes. These shared resources are available [describe access and availability].

Data Visualization

RCC maintains a data visualization laboratory in the Crerar Library Kathleen A. Zar Room. The lab is capable of high-definition 2D as well as sterescopic 3D visualizations. A high-power workstation in the lab direct-mounts the RCC storage systems to facilitate straightforward access to your research data.