Skip to content

Using R on the Midway Cluster

R is a powerful tool for quantitative research and data analysis across various fields.

R on Midway2 and Midway3

Table of Contents


Introduction

R is a powerful tool for quantitative research and data analysis across a variety of fields. On the Midway clusters, R is available as centrally maintained modules, supporting both interactive and batch (HPC) workflows. This guide covers best practices for installing, configuring, and running R, including reproducibility and performance tips for both Midway2 and Midway3.


Available R Modules

To see the list of available R modules:

module avail R

To check available RStudio IDE modules:

module avail rstudio


Environment Configurations

Available R Versions: - R/3.6.3 - R/4.0.3 - R/4.1.0 - R/4.2.0 - R/4.2.1 - R/4.2.1-no-openblas (built with default BLAS; no OpenBLAS/MKL acceleration) - R/4.2.2 - R/4.2.3 - R/4.3.1 - R/4.4.1 (default) - R/4.4.1+oneapi-2024.2.0 (Intel oneAPI build, see below) - R/4.4.2+gcc-13.2.0 (built with GCC 13.2.0, see below)

Build Details (click to expand)
  • Most R modules are built with the system GCC compiler (8.4.1).
  • R/4.2.1-no-openblas is built with the default BLAS (no OpenBLAS or MKL), which may result in slower linear algebra performance.
  • R/4.4.1+oneapi-2024.2.0 is built with Intel oneAPI 2024.2.0 and linked to Intel's Math Kernel Library (MKL). This provides maximum performance for matrix and linear algebra operations, leveraging the latest Intel hardware/software optimizations. The module automatically loads oneapi/2024.2 and sets all relevant paths and environment variables for you. Use this if you need high-performance math.
  • R/4.4.2+gcc-13.2.0 is built with GCC 13.2.0. Use this version if your R library requires a newer GCC toolchain (for example, the arrow package or other modern C++-dependent libraries).

Geospatial Meta-Module: - There is a meta-module for geospatial applications: gis/R-4.2.1. Loading this module will automatically load R/4.2.1 along with all required geospatial libraries (GDAL, GEOS, PROJ, SQLite, udunits) for spatial data analysis. This is the recommended setup for users working with spatial or GIS data on Midway3.

AMD Node Module: - On Midway3 AMD nodes, there is a special module R/4.3.2 (default) built with the AMD Optimizing C/C++ Compiler (aocc-4.1). This module is found under /software/modulefiles-amd and is optimized for AMD hardware. Use this version for best performance on AMD compute nodes.

Available R Versions: - R/2.15 - R/3.0 - R/3.2 - R/3.3 - R/3.3+intel-16.0 (built with Intel 16 compiler) - R/3.4 - R/3.4.1 - R/3.4.3 - R/3.5.1 - R/3.6.1 - R/3.6.3-no-openblas - R/4.0.0 - R/4.0.4 - R/4.1.0 - R/4.1.0-no-openblas - R/4.2.0

Build Details (click to expand)
  • Most R modules are built with the system GCC compiler (10.2.0 for recent versions; older modules may use earlier GCC or Intel compilers).
  • R/3.3+intel-16.0 is built with the Intel 16 compiler.
  • R/3.6.3-no-openblas and R/4.1.0-no-openblas use the default BLAS (no OpenBLAS), which may reduce linear algebra performance.
  • Most modern modules (R/4.1.0, R/4.2.0) are built with gcc/10.2.0.
  • All modules are linked to OpenBLAS for improved matrix and linear algebra performance unless otherwise noted.
  • For exact dependencies and environment, use ml show R/<version>.

Installing R Packages

Step 1: Check GCC Version

Ensure compatibility by verifying the GCC version:

g++ --version

Step 2: Check Disk Space

Ensure you have enough disk space:

quota

Step 3: Load Necessary Modules

For packages like ncdf4 and hdf5r, load the required modules before starting R:

module load hdf5 netcdf

Step 4: Install Packages

Start an R session and use the install.packages function:

install.packages("packageName")


Using the renv Package for Reproducibility

Quick Steps for renv Reproducibility

  1. Load the R module:
    module load R/4.3.1
    
  2. Navigate to your project directory and start R:
    cd /project/pi-cnetid/rproject/
    R
    
  3. Install and initialize renv:
    install.packages("renv")
    renv::init()
    
  4. Install packages as needed:
    install.packages("dplyr")
    # ...etc.
    
    This will create a project-local library and renv.lock file for reproducibility.

Using R for High-Performance Computing (HPC)

You can run R scripts in batch mode using SLURM. Example job script:

#!/bin/bash
#SBATCH --job-name=my_r_job
#SBATCH --account=[your-accountname]
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00

module load R/4.3.1
Rscript my_script.R

Parallel R with MPI (Rmpi)

The Rmpi package is not pre-installed in system R modules. You must install it in your own R user library, and Rmpi must be compiled against the same MPI implementation and version you will use at runtime.

Example SLURM Batch Script for Rmpi Job:

#!/bin/bash
#SBATCH --job-name=my_rmpi_job
#SBATCH --account=[your-accountname]
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --time=02:00:00

module load R/4.3.1 openmpi/5.0.2  # Example: use any available OpenMPI version
mpirun -np 4 Rscript my_rmpi_script.R

Step-by-step instructions: 1. Load R and OpenMPI modules before installing Rmpi:

module load R/4.3.1 openmpi/5.0.2  # Example: use any available OpenMPI version
R
2. Install Rmpi from within R:
install.packages("Rmpi", type = "source")
You may be prompted for MPI library paths; the correct openmpi module must be loaded for detection. 3. Check installation:
library(Rmpi)
mpi.universe.size()
4. Important: Always load the same openmpi module before running your parallel R jobs as you did when installing Rmpi.

Other OpenMPI Versions

To see all installed OpenMPI versions, run:

module avail openmpi

Version Compatibility

  • The openmpi version must match between installation and job execution.
  • If you change R or MPI module versions, reinstall Rmpi to match.

FAQ

Q1: How do I delete all R packages and start over?

rm -rf ~/R ~/.R ~/.rstudio* ~/.Rhistory ~/.Rprofile ~/.RData

Q2: How do I use RStudio on the cluster? - Load the appropriate rstudio module and follow your cluster’s instructions for interactive sessions.


Best Practices

R Module (Midway)

  • Uses centrally maintained, optimized, and regularly updated R installations provided by HPC admins.
  • Usually highly optimized (OpenBLAS).
  • Integrates with system-wide libraries and supports renv for reproducibility.
  • Recommended for most HPC users, especially for performance-critical or collaborative work.

R via Conda

  • Allows creation of isolated environments and user-space installation of R and dependencies.
  • Performance Caveat: Conda R often does not use the highly optimized BLAS/LAPACK libraries available on the cluster, which can result in slower performance for linear algebra-intensive tasks.
  • Use Conda R only if you need a specific version or package unavailable in system modules, or require full environment isolation.
  • If using Conda R, you can still use install.packages() or renv within the environment, but be aware of possible compatibility and performance trade-offs.
Feature/Aspect R Module (Midway) Conda R
BLAS/LAPACK Optimization Usually highly optimized (OpenBLAS) Often default/less optimized BLAS
Performance Best for heavy computation May be slower for matrix operations
Integration Well-integrated with system libraries Isolated, may miss system optimizations
Reproducibility Use with renv or modules Use with renv or Conda YAML
Use Case Recommended for most HPC users For custom/isolated needs only

Troubleshooting

  • Error: Package installation fails due to missing system libraries
  • Solution: Load the necessary modules (e.g., module load hdf5 netcdf) before starting R.
  • Error: R cannot find installed packages
  • Solution: Check your .libPaths() in R and ensure you are using the correct environment/module.
  • Error: R job runs slowly
  • Solution: Use the system R module for optimized BLAS/LAPACK performance. Avoid Conda R for heavy computations unless necessary.
  • Error: Permissions or disk quota exceeded
  • Solution: Check your disk space with quota and clean up files if needed.

Feedback

For questions, suggestions, or to report issues, please contact the RCC support team or submit feedback via the documentation repository.


module avail rstudio

When using RStudio, it is recommended to connect to the RCC cluster via ThinLinc for a smoother experience.

Spatial Packages

The sf package in R provides tools for handling spatial data using simple features. For more information, visit the sf Package on CRAN.

Midway2 Environment

  • Dependencies: GDAL 2.4.1, udunits 2.2
  • R Version: 4.2.0

Midway3 Environment

  • Dependencies: GDAL 3.3.3, udunits 2.2, GEOS 3.9.1, GCC 10.2.0, SQLite 3.36.0
  • R Versions: 4.2.1 and 4.3.1

Module Loading

Before using or installing sf on Midway3, load the necessary modules:

module load gdal/2.4.1 udunits/2.2 R/4.2.0
module load gdal/3.3.3 udunits/2.2 geos/3.9.1 gcc/10.2.0 sqlite/3.36.0 R/4.3.1

Additional Packages

The terra and raster packages are also installed.

Installing R Packages: A Comprehensive Guide

R packages can be distributed as source code or compiled binaries. Source packages must be compiled before installation, typically requiring the GNU Compiler Collection (GCC). Here’s how to get started with R package installation on our HPC clusters.

Check GCC Version

To ensure compatibility, verify the version of GCC (g++) on your system:

$ g++ --version
g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Before You Install

  1. Check Disk Space: Ensure you have enough disk space using the quota command:

    $ quota
    

  2. Load Necessary Modules: For packages like ncdf4 and hdf5r, load the required modules before starting R:

    $ module load hdf5 netcdf
    
    For more information on working with netCDF files in R, you can check out this resource: NetCDF with R.

Installing Packages

To install R packages, use the install.packages function within an R session:

install.packages("packageName")

For multiple packages:

install.packages(c("package1", "package2"))

FAQ

  1. Delete All R Packages and Start Over: If you need to reset your R environment, delete all R packages and related files:

    $ rm -rf ~/R ~/.R ~/.rstudio* ~/.Rhistory ~/.Rprofile ~/.RData
    

  2. Update ~/.bashrc: Remove any added or modified environment variables if necessary.

  3. See Where R Packages Are Installed:

    .libPaths()
    Sys.getenv("R_LIBS_USER")
    

  4. List Installed Packages:

    installed.packages()
    

  5. List Default Packages:

    getOption("defaultPackages")
    

  6. Remove a Package:

    remove.packages("packageName")
    

Using the renv R Package to Create a Personal R Project Environment on the Midway Cluster

The renv package in R helps manage project-specific libraries, ensuring consistent package versions across projects. Here’s a step-by-step guide to set up and use renv for your project located at /project/pi-cnetid/rproject/ on the Midway cluster.

Step 1: Load Necessary Modules

Before starting, ensure you have the necessary modules loaded. You can load the R module appropriate for your project:

module load R/4.3.1

Step 2: Set Up the Project Directory

Navigate to your project directory:

cd /project/pi-cnetid/rproject/

Step 3: Initialize renv in Your Project

Start R and initialize renv in your project directory:

R

Inside the R session, run:

# Install renv if not already installed
install.packages("renv")

# Initialize renv in the project directory
renv::init()

This command will set up a new project-specific library in the renv subdirectory of your project directory and create an renv.lock file to record the state of your library.

Step 4: Install Packages within the renv Environment

With renv initialized, install any necessary packages:

# Install required packages
renv::install("dplyr")
renv::install("ggplot2")
# Add more packages as needed

Step 5: Snapshot the Current State of Your Library

After installing the necessary packages, snapshot the current state of your library to renv.lock:

renv::snapshot()

This records the exact versions of the packages you have installed, allowing you to recreate the environment later.

Step 6: Restore Library from renv.lock (Optional)

If you need to recreate the environment on another system or after a system change, you can restore the library:

renv::restore()

This command reads the renv.lock file and installs the specified package versions.

Step 7: Using the Project

Whenever you start working on your project, load the renv environment:

# Activate the renv environment
renv::activate()

This ensures that all package installations and library paths are managed by renv.

Example Workflow

Here’s a complete example workflow from the terminal to R session:

  1. Load R Module:

    module load R/4.3.1
    cd /project/pi-cnetid/rproject/
    

  2. Start R and Initialize renv:

    R
    

  3. Within R Session:

    install.packages("renv")
    renv::init()
    renv::install("dplyr")
    renv::install("ggplot2")
    renv::snapshot()
    q()  # Quit R session
    

  4. Future Sessions:

    renv::activate()
    

By following these steps, you can set up and manage a project-specific R environment on the Midway cluster, ensuring consistency and reproducibility for your R projects.