Introduction to Data Transfer using Globus

Joe Wu, PhD
NCI CCR Bioinformatics Training and Education Program
ncibtep@nih.gov

Globus

"Globus is a high-performance data-transfer and sharing platform that allows you to move large and complex datasets directly between any two applications, systems, or local machines, eliminating the need for downloading and then uploading the data." -- https://www.cuit.columbia.edu/research-data-transfer

Why use Globus

  • Recommended for transferring large quantities of data including next generation sequencing (NGS).
  • "Globus will manage file transfers, monitor performance, retry failures, recover from faults automatically when possible, and report the status of your data transfer." -- Biowulf. Once a transfer is initiated, the user can walk away from the computer.
  • Fast and secure way to transfer data.

Possible Globus Endpoints

Possible Globus Endpoints (table)

Globus Transfer Between Two Desktops

"If you need to transfer data between two Globus Connect Personal endpoints (e.g. your desktop system and your laptop, or between two desktops), you will need a Globus Plus license. Email staff@hpc.nih.gov to request one. Your Globus Plus license will be terminated when you leave NIH." -- Biowulf

General Steps to Using Globus

  • Have a Biowulf account.
  • Install the Globus desktop client to local computer. This enables the use of local computer as a data transfer endpoint.
    • NCI staff should submit a ticket with service.cancer.gov to get software installed on government furnished computer.
    • Staff from other ICs should contact their corresponding computing help desk for software installation.
  • Setup data transfer endpoints.
  • Initiate data transfer.

General Steps to Using Globus Illustrated

Source: https://www.globus.org/data-transfer

Help Resources

Logging into Globus

Use https://www.globus.org/ to sign onto Globus. Google Chrome is recommended.

Accessing Globus from Biowulf HPC OnDemand

Globus can be accessed from Biowulf HPC OnDemand. Just click on any of the user's Biowulf directories under the "Files" tab and then on the "Globus" icon.

Select Affiliation

After click on "LOG IN" at the Globus page, users will be prompted to select institutional affiliation. For NIH, just type "national institutes of health" in the drop down menu. Click "Continue" when ready.

Sign in with PIV Card

Subsequently, select to sign onto Globus using NIH PIV card credentials and enter PIN when prompted.

Agree to the Terms of Globus and Authenticate

At the next screen, click "I Agree" to accept the terms of Globus.

Globus Landing Page

Globus File Endpoint

"An "endpoint" is one of the two file transfer locations – either the source or the destination – between which files can move. Once a resource (server, cluster, storage system, laptop, or other system) is defined as an endpoint, it will be available to authorized users who can transfer files to or from this endpoint." -- Globus

Globus Collection

The Collections tab provides a table with metadata regarding the data transfer endpoints that a user has setup.

Globus Collection Table

Clicking on the "Collections" tab, the following table is shown.

Globus Collection Table: Columns Explanation

  • COLLECTION: contains name of the endpoint. The example below shows the NIH HPC data transfer endpoint.
  • SUBSCRIBED: This column when checked indicates that the endpoint belongs to a organization that has a Globus subscription.
  • HA: This column refers to high assurance collections and when checked indicates that the endpoint is suitable for things like personal health data (PHI). Biowulf cannot be used for PHI so this column is not checked.
  • STATUS: This indicates whether the endpoint is ready to use.
  • ROLE: Informs of whether the user has a things like administrative rights to the endpoint.
  • Click on in the far right to link to the file transfer manager.

Globus Endpoint Overview

Click on ">" on the far right of the Collection table to see more detailed information regarding an endpoint. The UUID is important and needed for data transfer.

Globus Endpoint to Local Computer

Setting up Globus Local Endpoint (step 1)

Launch Globus desktop client and choose "Log In".

Setting up Globus Local Endpoint (step 2)

Users may need to allow Globus desktop client to find local folders. Select "Allow" and sign onto Globus as shown in earlier slides.

Setting up Globus Local Endpoint (step 3)

After re-authenticating, fill out a label for the endpoint and click "Allow".

Setting up Globus Local Endpoint (step 4)

Subsequently, provide the name of the collection as well as a description. Hit "Save" when ready.

Setting up Globus Local Endpoint (step 5)

The endpoint appears under the "Administered By You" tab in the collections table.

Data Transfer from Local to Biowulf (step 1)

Click on the "COLLECTIONS" tab and select the "NIH HPC Data Transfer (Biowulf)" endpoint. Select "Transfer or Sync" to start a data transfer. A second file manager window opens. Here, click the magnifying glass to search for the endpoint to transfer to or from. This example will use the local computer endpoint that was setup.

Data Transfer from Local to Biowulf (step 2)

Once the local endpoint ("Joe.Wu.local") is selected type the path to the file or folder that needs to be transferred. Click "Start" when ready. To transfer from Biowulf, just click on "Start" on the "NIH HPC Data Transfer (Biowulf)" endpoint panel.

Data Transfer from Local to Biowulf (step 3)

A message will appear if the transfer request was successfully submitted.

Data Transfer from Local to Biowulf (step 4)

Click on "ACTIVITY" to view details such as transfer progress. Once the transfer is complete, users will get an email.

Terminating a Transfer

In the activity monitor, users can select a transfer task and subsequently terminate it if needed.

Data Transfer from Local to Biowulf (step 5)

After transfer completes, refresh the "NIH HPC Data Transfer (Biowulf)" endpoint and click on "LAST MODIFIED" to ensure the data was successfully transferred.

Schedule Data Transfer

Users can schedule data transfer.

Transfer from NCI CCR Sequencing Facility Data Management Environment: Overview

This example applies to those researchers who utilize the NCI CCR Sequencing Facility for sequencing experiments. The sequencing facility will:

  • Provide a link to their Data Management Environment (DME) for researchers to access their data.
  • Sequencing facility will also do many of the analysis steps for the researchers including QC.

Please check with the specific core for data management and transfer issues if not using NCI CCR Sequencing Facility.

Make a New Folder in Globus

Open the "NIH HPC Data Transfer (Biowulf)" endpoint and create a folder for Globus transfers in the data. In the example, below, the folder is named "globus_transfers".

Add Guest Collection

Next, go back to the "NIH HPC Data Transfer (Biowulf)" endpoint and click on "Add Guest Collection".

Provide Guest Collection Information

In the subsequent page, supply the directory in which to link the guest collection to. As an example, globus_transfer under Biowulf data folder. Provide a display name and description for the guest collection and hit "Create Collection" when done.

Granting Permission for NCI CCR SF DME to Transfer to Guest Collection (step 1)

Next, grant permission for the NCI CCR Sequencing Facility Data Management Environment to share data with the globus demonstration collection.

Granting Permission for NCI CCR SF DME to Transfer to Guest Collection (step 2)

  • Keep / in the Path box.
  • Be sure to select share with group.
  • Make sure that permissions are set to read and write.
  • Click "Select a Group" to find the group to grant permission to this endpoint to.

Granting Permission for NCI CCR SF DME to Transfer to Guest Collection (step 3)

Enter the name of the group in which to grant permission for the collection to or start typing and a list of options will appear for users to choose from.

Granting Permission for NCI CCR SF DME to Transfer to Guest Collection (step 4)

The group that is getting permission granted is now listed next to the "Group" column. Click "Add Permission" when ready.

Granting Permission for NCI CCR SF DME to Transfer to Guest Collection (step 5)

When prompted, click "Done" to complete the permission granting process.

Granting Permission for NCI CCR SF DME to Transfer to Guest Collection (step 6)

Look at the "Overview" for the "globus demonstration" collection. Note the UUID. This will be needed when transferring file from the sequencing facility DME to Biowulf.

Getting Data from Sequencing Facility DME (step 1)

In the sequencing facility DME page where data is stored, click on "Browse project data" to peruse specific data or download the all of the data.

Getting Data from Sequencing Facility DME (step 2)

This example will browse for specific data to download. Just click on the "Download" button to the far right of the file content table when ready.

Getting Data from Sequencing Facility DME (step 3)

In the next page, select the Globus radial button under "Transfer Type" and supply the Globus endpoint UUID.

Getting Data from Sequencing Facility DME (step 4)

Users will see the message highlighted in blue in the image below when the transfer request has been successfully submitted.

Getting Data from Sequencing Facility DME (step 5)

Clicking on "Manage" and then "Download Tasks", the status for the data transfer will change to complete when done.

Getting Data from Sequencing Facility DME - task status

If a task status of "RESTORE_REQUESTED" appears, then it is likely that the dataset has been placed in archive and DME has to download it from a cloud service prior to transferring to Biowulf. This process may take approximately 12 hours.

Getting Data from Sequencing Facility DME (step 6)

Users will also receive an email from sequencing facility DME that transfer was completed.

Getting Data from Sequencing Facility DME (step 7)

The folder in which the data was transferred to is now populated with content.