Train a model on a CosmicAC GPU

In this tutorial, you will train a model on a CosmicAC GPU container. By the end, you will have run a training job on a GPU, seen live loss output in your terminal, and saved a model checkpoint inside the container.

This tutorial uses a simple training script so the focus stays on the CosmicAC workflow. Once you have completed it, you can swap in your own script and follow the same steps.

Before you begin

You will need:

A CosmicAC account
The CosmicAC CLI installed and authenticated. See Installation.

Create a GPU container job

You can create the job from the CLI or the web UI. This tutorial uses the CLI.

Run the interactive setup:

cosmicac jobs init

Follow the prompts. Select the GPU_CONTAINER type, the RTX H100 GPU, and the ubuntu:24.04 container image. For a walkthrough of every prompt, see Create a GPU Container Job.

PyTorch and CUDA are not pre-installed. You will install them yourself inside the container.

Review job.config.json and confirm the settings, then submit the job:

cosmicac jobs create

Check that your container is running:

cosmicac jobs list

Copy the job ID and container ID from the output. Your container is ready when the status shows Running.

Open a shell session

Connect to the container:

cosmicac jobs shell <jobId> <containerId>

Verify the GPU is available:

nvidia-smi

You should see your GPU listed with its VRAM. You are now inside the container.

Install pip and PyTorch

Install pip:

apt-get update && apt-get install -y python3-pip

Install PyTorch with CUDA support:

pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Verify PyTorch can see the GPU:

python3 -c "import torch; print(torch.cuda.is_available())"

This should print True. If it prints False, check that nvidia-smi shows a GPU before continuing.

Create the training script

Create a new file called train.py:

nano train.py

Paste the following script into the file. It trains a small neural network to fit a straight line, simple enough to finish in under a minute and produce real loss output and a checkpoint:

train.py

import torch
import torch.nn as nn
import os


# Seed for reproducibility
torch.manual_seed(42)

# Generate synthetic data: y = 3x + 1 with a small amount of noise
X = torch.linspace(0, 1, 100).unsqueeze(1)
y = 3 * X + 1 + 0.1 * torch.randn(100, 1)


# Select device — use the GPU if available, otherwise fall back to CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Training on: {device}")

X = X.to(device)
y = y.to(device)


# Define the model, optimizer, and loss function
model     = nn.Linear(1, 1).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn   = nn.MSELoss()


# Training loop
for epoch in range(100):

    predictions = model(X)
    loss        = loss_fn(predictions, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch + 1}/100 — loss: {loss.item():.4f}")


# Save the trained model weights
os.makedirs("./checkpoints", exist_ok=True)
torch.save(model.state_dict(), "./checkpoints/model.pt")
print("Checkpoint saved to ./checkpoints/model.pt")

Save and exit with Ctrl+X → Y → Enter.

Run the training job

python3 train.py

You should see output like this:

Training on: cuda
Epoch 10/100 — loss: 0.3241
Epoch 20/100 — loss: 0.1823
Epoch 30/100 — loss: 0.1204
Epoch 40/100 — loss: 0.0961
Epoch 50/100 — loss: 0.0880
Epoch 60/100 — loss: 0.0852
Epoch 70/100 — loss: 0.0843
Epoch 80/100 — loss: 0.0840
Epoch 90/100 — loss: 0.0839
Epoch 100/100 — loss: 0.0839
Checkpoint saved to ./checkpoints/model.pt

Training on: cuda confirms the GPU is being used. If you see Training on: cpu, run nvidia-smi to confirm the GPU is visible inside the container. If it is, the problem is the install. Reinstall PyTorch with the correct CUDA wheel.

Stop the job

Exit the shell session:

exit

Stop the container to avoid further charges:

cosmicac jobs stop <jobId>

What you have done

You created a CosmicAC GPU container job, installed PyTorch inside the container, ran a training script on the GPU, saved a model checkpoint, and stopped the job.

This tutorial used PyTorch, but you can install any ML framework the same way, such as TensorFlow, JAX, or Hugging Face Transformers. The CosmicAC workflow is the same for any training script. Swap in your own code and follow the same steps.

Next steps

GPU Container Job — understand how containers, shell access, and job lifecycle work on CosmicAC
Job Management CLI reference — full reference for all job commands
GPU Types — available GPU hardware and VRAM configurations

Train a model on a CosmicAC GPU

On this page