Usage Guide

This guide covers the main features and use cases of PyRemoteData.

Basic File Operations

Listing Files

from pyremotedata.implicit_mount import IOHandler

with IOHandler() as io:
    files = io.ls()
    print(f"Files in current directory: {files}")

    # List files in specific directory
    files = io.ls("/path/to/directory")
    print(f"Files in /path/to/directory: {files}")

Changing working directory

from pyremotedata.implicit_mount import IOHandler

with IOHandler() as io:
    io.cd("/remote/directory")
    print(f"Working directory: {io.pwd()}")

Downloading Files

from pyremotedata.implicit_mount import IOHandler

with IOHandler() as io:
    # Download single file
    local_path = io.download("/remote/file.txt", "/local")

    # Download directory
    local_path = io.download("/remote/directory", "/local/directory")

Uploading Files

from pyremotedata.implicit_mount import IOHandler

with IOHandler() as io:
    # Upload single file
    io.upload("/local/file.txt", "/remote")

    # Upload directory (use mirror for directories)
    io.upload("/local/directory", "/remote")

Advanced Operations

Batch Operations

Perform operations on multiple files:

from pyremotedata.implicit_mount import IOHandler

with IOHandler() as io:
     files = io.ls("/remote/dataset")

     # Download multiple files at once
     txt_files = [f"/remote/dataset/{file}" for file in files if file.endswith('.txt')]
     local_paths = io.download(txt_files, "/local/dataset")

Synchronizing directories

from pyremotedata.implicit_mount import IOHandler

with IOHandler() as io:
     # Navigate to the directory
     io.cd("my_directory")

     # Synchronize directory to local storage
     io.sync("<local_parent_directory>", progress=True)

Performance Optimization

Why RemotePathIterator?

RemotePathIterator streams many files efficiently by batching and prefetching downloads in a background thread while your main thread consumes files. This is ideal when:

  • You need steady throughput from a high-latency/high-bandwidth SFTP server

  • You process files one-by-one (e.g., parsing, feature extraction)

  • You want automatic local cleanup to avoid filling disks

Basic Pattern

from pyremotedata.implicit_mount import IOHandler, RemotePathIterator

with IOHandler() as io:
    # Build an index of files (persisted remotely unless store=False)
    iterator = RemotePathIterator(
        io_handler=io,
        batch_size=64,          # files per batch
        batch_parallel=10,      # parallel transfers per batch
        max_queued_batches=3,   # prefetch up to 3 batches
        n_local_files=64*3*2,   # keep enough local files to avoid deletion while consuming
        clear_local=True,       # automatically delete after consumption
        # kwargs forwarded to io.get_file_index()
        # store=True (default) creates a folder_index.txt on remote for faster reuse
        # pattern=r"\.jpg$" to filter
    )

    # Optional: change order or subset before iterating
    # iterator.shuffle()
    # iterator.subset(list_of_indices)

    for local_path, remote_path in iterator:
        # Process the file
        process_file(local_path, remote_path)

Controlling Throughput and Memory

  • batch_size: Larger batches reduce command overhead; increase until memory or server limits are hit

  • batch_parallel: More parallel transfers increase network utilization; tune for server fairness and stability

  • max_queued_batches: Prefetch depth; higher values smooth throughput but use more local storage

  • n_local_files: Must exceed batch_size * max_queued_batches. Use 2x that as a safe default

  • clear_local: Enable to automatically remove consumed files and control disk usage

Dataset Splits and Reuse

from pyremotedata.implicit_mount import IOHandler, RemotePathIterator

with IOHandler() as io:
    # RemotePathIterator loops over all files in the
    # current working directory and its subdirectories
    io.cd("/remote/training_data")
    it = RemotePathIterator(io, batch_size=64, batch_parallel=8, max_queued_batches=2)

    # Create non-overlapping splits for sequential use (not parallel)
    train_it, val_it = it.split(proportion=[0.8, 0.2])

    for lp, rp in train_it:
        train_step(lp, rp)

    for lp, rp in val_it:
        validate_step(lp, rp)

Indexing Strategies

RemotePathIterator uses io.get_file_index() underneath. You can speed up repeated runs by persisting the index on the remote folder (default).

with IOHandler() as io:
    # Persisted index (default: store=True); override=True rebuilds it
    it = RemotePathIterator(io, batch_size=64, store=True, override=False)

    # Read-only remote? Disable store (slower)
    it_ro = RemotePathIterator(io, batch_size=64, store=False)

Best Practices

  • Use context managers: Always use with statements to ensure proper cleanup

  • Handle large files: Use streaming for files larger than available memory

  • Batch operations: Group related operations to minimize connection overhead