Rubin DP1 Feature Generation¶
The pipeline supports running against Rubin Data Preview 1 (DP1) data stored as local parquet files, bypassing the TAP API entirely. This is the recommended approach for large-scale feature generation.
Prerequisites¶
You need three gzip-compressed parquet files downloaded from the Rubin Science Platform and placed in a single directory:
Configuration¶
Tell scope-ml to use local files instead of the TAP API by setting data_path in config.yaml:
Or set the environment variable:
When data_path (or RUBIN_DATA_PATH) is set, all Rubin commands automatically use the local parquet backend (RubinLocalClient) instead of the TAP client. No token is needed for local mode.
Single-Run Feature Generation¶
For a small number of sources (e.g. a cone search or a short object list):
# Cone search
generate-features-rubin --ra 62.0 --dec -37.0 --radius 60 --doCPU
# From a CSV file with an objectId column
generate-features-rubin --objectid-file my_objects.csv --doCPU
Output is written to generated_features_rubin/gen_features_rubin.parquet by default.
Large-Scale Processing with SLURM¶
For processing the full DP1 catalog, use the chunked SLURM workflow:
1. Prepare Chunks¶
Scan the local parquet files, filter to objects with enough detections, and split into chunk CSVs:
prepare-rubin-chunks \
--data-path /path/to/dp1_data/ \
--chunk-size 5000 \
--min-n-lc-points 50 \
--output-dir rubin_chunks
This writes rubin_chunks/chunk_000.csv, rubin_chunks/chunk_001.csv, etc., plus a master list rubin_chunks/all_eligible_objectids.csv.
2. Generate the SLURM Array Script¶
generate-features-rubin-slurm \
--chunk-dir rubin_chunks \
--output-dir rubin_slurm \
--venv /path/to/your/.venv \
--cpus-per-task 8 \
--top-n-periods 50
This writes rubin_slurm/run_rubin_features.sh. Edit the script to adjust partition, account, memory, and module loads for your cluster.
3. Submit the Array Job¶
Each array task processes one chunk and writes generated_features_rubin/gen_features_rubin_<TASK_ID>.parquet.
4. Combine Results¶
After all jobs finish:
combine-rubin-features \
--input-dir generated_features_rubin \
--output generated_features_rubin/dp1_features_combined.parquet
Single-Node Chunked Runner (No SLURM)¶
If you don't have a SLURM cluster, you can run chunks sequentially (or resume after interruption) with:
python tools/run_rubin_chunked.py \
--objectid-file rubin_chunks/all_eligible_objectids.csv \
--doCPU \
--chunk-size 5000 \
--top-n-periods 50
Completed chunks are saved to generated_features_rubin/chunks/ and skipped on restart, so the job is resumable.
CLI Reference¶
| Command | Description |
|---|---|
get-rubin-ids |
Discover object IDs via cone search or read from CSV |
generate-features-rubin |
Generate features for a set of Rubin sources |
prepare-rubin-chunks |
Scan local parquet files and split eligible objects into chunk CSVs |
generate-features-rubin-slurm |
Generate a SLURM array job script from chunk files |
combine-rubin-features |
Merge per-chunk parquet outputs into a single file |