Big Data is a collection of data sets so large and complex that they are difficult to process using traditional (desktop environment) data processing techniques. The challenges of these data sets include the capture, curation, storage, search, sharing, transfer, analysis, and visualization of the elements they contain. Despite these problems, a clear trend to larger data sets is clear among UChicago research programs.
In response to these trends the RCC has established a technological framework to enable researchers to work with today’s tera- and petascale research problems.
Many funding agencies, including the NSF and NIH, require some form of data management plan to supplement funding proposals. The plan describes how the proposal will conform to the agency’s data integrity, availability, sharing, and dissemination policies. RCC can assist researchers with the development of data management plans by providing accurate descriptions of the data storage, versioning, backup, and sharing capabilities that are available to UChicago researchers.
It is imperative that RCC is included in the development of data management commitments that leverage RCC resources. Please contact us (email@example.com) to learn more.
The following is an example of text that can be used in your data management plans. It is critical that you contact RCC to review your data management plan if RCC resources are involved.
Data will be stored on the University of Chicago Research Computing Center (RCC) Project Storage Service. Project Storage sits on a petascale GPFS file system managed by professional system administrators. Snapshots, point-in-time views of the file system, are maintained at hourly, daily, and weekly intervals; allowing researchers to independently recover older versions of files at any time. The system is backed up nightly to a tape archive in a separate location. Project Storage is accessible both through the UChicago HPC Cluster and Globus Online (endpoint ucrcc#midway), which provides straightforward high-performance transfer and sharing capabilities to researchers. Files can be shared with any Globus Online user, at the UChicago or elsewhere, without need for an account on the UChicago HPC cluster or other resources.
RCC Project Storage is connected to the UChicago campus backbone network at 10 gigabit/s, and to the UChicago HPC cluster, available to researchers for data analysis, at 40 gigabit/s. This professionally managed, reliable and highly-available infrastructure is suitable for capturing, generating, analyzing, storing, sharing, and collaborating on petabytes of research data.
RCC manages three storage complimentary storage systems that each fill a unique niche on the research computing landscape. Briefly, Home and Project storage is maintained on a 1.5 petabyte file system that is both backed-up and version controlled with GPFS snapshots. Scratch storage is a 80 terabyte high-performance shared resource. HSM is a petascale-capable tape system that is useful for storing data that is not expected to be referenced again, but must be kept for the lifetime of a project to meet certain requirements or best practices.
Persistent storage areas are appropriate for long term storage. They have both file system snapshots and tape backups for data protection. The two locations for persistent storage are the home and project directories.
Every RCC user has a home directory located at /home/CNetID. This directory is accessible from all RCC compute systems and is generally used for storing frequently used items such as source code, binaries, and scripts. By default, a home directory is only accessible by its owner (mode 0700) and is suitable for storing files which do not need to be shared with others. The standard quota 10 GB.
Project storage is generally used for storing files which are shared by members of a research group. It is accessible from all RCC compute systems. Every research group is granted a startup 500 GB quota, though scaling individual projects to hundreds of terabytes is straightforward on the RCCs 1.5 PB system. Additional storage is available through the Cluster Partnership Program. Contact firstname.lastname@example.org to learn more.
Scratch space is a high-performance shared resource intended to be used for active calculations and analysis run on the compute clusters. Users are limited to 5 terabytes of scratch.
RCC nodes are connected by two network fabrics - Infiniband (40 gb/s) and Gigabit Ethernet (1 gb/s). The fastest network available on a compute node is used for both interprocess communications and reading and writing to the shared storage systems. Accordingly, the time required to perform file system operations such as moving and copying data will vary according to the available network bandwidth.
Performance is additionally influenced by the characteristics of the file system that is used. Taken together, these factors can result in orders-of-magnitude differences in time to perform seemingly very similar operations. For example, consider the the table below which indicates the time required to operate on 24 gigabytes on project storage and scratch storage, from GigE nodes and Infiniband nodes.
|Project IB||Scratch IB||Project GigE||Scratch GigE|
|24 GB (100 x 240 MB files)||40||20||500||500|
|24 GB (100,800 245 KB files)||1805||875||1150||1150|
Project storage can be purchased through the Cluster Partnership Program in units as small as 1 terabyte, or exceeding 100 terabytes. Contact email@example.com to learn more.
The RCC computing infrastructure is connected to the UChicago backbone network at XX Gb/s. This connection is rarely saturated; when you are transferring data to the RCC the time required to complete a data transfer is generally limited by the network bandwidth available to the machine you are transferring from.
The table below indicates the amount of time required to transfer a given amount of data, in a best-case scenerio, at an indicated speed.
10 Mb roughly corresponds to the speed of a modest home internet connection. The UChicago wireless network is capable of sustained 20-40 Mb transfers. Most of the data ports on campus are 100 Mb, although they can be upgraded with a request to IT Services (INSERT LINK) to 1 Gb in most buildings.
|10 Mb/s||100 Mb/s||1 Gb/s||10 Gb/s||100 Gb/s|
|1 PB||25 years||2 years||92 days||9 days||22 hours|
|100 TB||3 years||92 days||9 days||22 hours||2 hours|
|1 TB||9 days||22 hours||2 hours||13 mins||1 min|
|100 GB||22 hours||2 hours||13 min||1 min||8 sec|
|10 GB||1 hours||13 mins||1 min||8 sec||<1 sec|
|1 GB||13 mins||1 min||8 sec||<1 sec||<<1 sec|
Transfer times are variable, but it is reasonable to estimate based on this table that, for example, transferring 2 TB of data from an external hard drive that is plugged into a lab workstation via UChicago Ethernet (100 Mb) will take 40 to 50 hours.
The best way to transfer large amounts of data is with the Globus Online (https://www.globusonline.org/) data movement service. Use your CNet ID to sign in at globus.rcc.uchicago.edu. Globus offers a number of advantages over traditional Unix tools like SCP and rsync for large, regular, or otherwise complex file transfers. These include:
Of course common tools such as secure-copy and rsync are also available through any remotely accessible interactive node.
The RCCs computing infrastructure enables researchers to perform data-intensive computing as well as flop-intensive computing. Elements that are particularly useful for dealing with large data are described below.
The nonblocking FDR10 40 gb/s Infiniband network provides up to 5 GB/s reads and writes to and from the shared storage systems.
Ten compute nodes are available with very large (18 terabytes) local storage arrays attached to the nodes. These shared resources are available [describe access and availability].
RCC maintains a data visualization laboratory in the Crerar Library Kathleen A. Zar Room. The lab is capable of high-definition 2D as well as sterescopic 3D visualizations. A high-power workstation in the lab direct-mounts the RCC storage systems to facilitate straightforward access to your research data.