Running Inference¶

Overview¶

Running inference requires the following steps: download IDs of a field, download (or generate) features for all downloaded IDs, then run inference for all available trained models.

get-quad-ids --field <field_number> --whole-field
get-features --field <field_number> --whole-field --impute-missing-features

OR

generate-features --field <field_number> --ccd <ccd_number> --quad <quad_number> --doGPU

The optimal way to run inference is through an inference script generated by running create-inference-script with the appropriate arguments. After creating the script and adding the needed permissions (e.g. using chmod +x), run inference on a field with:

create-inference-script --filename get_all_preds_xgb.sh \
  --group-name ss23 --algorithm xgb \
  --period-suffix ELS_ECE_EAOV --feature-directory generated_features

chmod +x get_all_preds_xgb.sh
./get_all_preds_xgb.sh <field_number>

Requires a models_dnn/ or models_xgb/ folder in the root directory containing the pre-trained models for DNN and XGBoost, respectively.
In a preds_dnn or preds_xgb directory, creates a single .parquet (and optionally .csv) file containing all IDs of the field in the rows and inference scores for different classes across the columns.
If running inference on specific IDs instead of a field/ccd/quad (e.g. on GCN sources), run ./get_all_preds.sh specific_ids.

Note

create-inference-script will raise an error if the inference script filename already exists.
Inference begins by imputing missing features using the strategies specified in the features: section of the config file.

Running Inference on HPC Resources¶

run-inference-slurm and run-inference-job-submission can be used to generate and submit SLURM scripts to run inference for all classifiers in parallel using HPC resources.

Examining Predictions¶

The result of running the inference script will be a parquet file containing some descriptive columns followed by columns for each classification's probability for each source in the field. By default, the file is located as follows:

import pathlib
path_preds = pathlib.Path.home() / "scope/preds_xgb/field_297/field_297.parquet"

SCoPe's read_parquet utility offers an easy way to read the predictions file and provide it as a pandas DataFrame:

from scope.utils import read_parquet
preds = read_parquet(path_preds)

Analyzing Predictions¶

Comparing DNN and XGB Scores¶

After running inference for multiple fields, compare DNN and XGB prediction agreement:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scope.utils import read_parquet

# Load predictions for one or more fields
field_list = [487, 563, 777]
path_to_dnn_preds = "preds_dnn"
path_to_xgb_preds = "preds_xgb"

dnn_frames, xgb_frames = [], []
for field in field_list:
    dnn_frames.append(read_parquet(f"{path_to_dnn_preds}/field_{field}/field_{field}.parquet"))
    xgb_frames.append(read_parquet(f"{path_to_xgb_preds}/field_{field}/field_{field}.parquet"))

field_preds_dnn = pd.concat(dnn_frames)
field_preds_xgb = pd.concat(xgb_frames)

# Merge into a single DataFrame
merge_cols = ["_id", "Gaia_EDR3___id", "AllWISE___id", "PS1_DR1___id",
              "ra", "dec", "period", "field", "ccd", "quad", "filter"]
dnn_xgb_preds = pd.merge(field_preds_dnn, field_preds_xgb, on=merge_cols)

DNN vs XGB Histogram¶

def hist_plot(classif):
    """Histogram comparing DNN and XGB score distributions."""
    fig, ax = plt.subplots()
    ax.hist(field_preds_dnn[classif + "_dnn"], bins=50, alpha=0.5, label="DNN")
    ax.hist(field_preds_xgb[classif + "_xgb"], bins=50, alpha=0.5, label="XGB")
    ax.set_xlabel("Score")
    ax.set_ylabel("Count")
    ax.set_title(classif)
    ax.legend()
    return fig

hist_plot("e")

DNN vs XGB Agreement Heatmap¶

def heatmap(classif):
    """2D histogram comparing DNN and XGB predictions."""
    dnn_scores = dnn_xgb_preds[classif + "_dnn"]
    xgb_scores = dnn_xgb_preds[classif + "_xgb"]

    fig, ax = plt.subplots()
    h = ax.hist2d(dnn_scores, xgb_scores, bins=50, cmap="viridis")
    plt.colorbar(h[3], ax=ax)
    ax.set_xlabel("DNN")
    ax.set_ylabel("XGB")
    ax.set_title(classif)

    # Agreement fraction at threshold
    thresh = 0.5
    agree = np.mean(
        (dnn_scores > thresh) == (xgb_scores > thresh)
    )
    return fig, agree

Training Set Label Distribution¶

Count the number of positive examples per class in the training set:

training_set = read_parquet("fritzDownload/training_set.parquet")

threshold = 0.7
counts = {}
for col in training_set.columns:
    if col.startswith(("_", "ra", "dec", "period", "field", "ccd", "quad")):
        continue
    n_positive = (training_set[col] >= threshold).sum()
    if n_positive > 0:
        counts[col] = n_positive

counts_df = pd.DataFrame.from_dict(counts, orient="index", columns=["count"])
counts_df = counts_df.sort_values("count", ascending=False)

Evaluating Training Results¶

After running scope.py assemble_training_stats for DNN and XGB algorithms, load and compare precision/recall:

import json
import glob

classifications = sorted(set(
    c.removesuffix("_dnn") for c in field_preds_dnn.columns if c.endswith("_dnn")
))

# Load stats from assembled JSON files
def load_stats(pattern):
    stats = {}
    for classif in classifications:
        files = glob.glob(pattern.format(classif=classif))
        if files:
            with open(files[0]) as f:
                stats[classif] = json.load(f)
    return pd.DataFrame.from_dict(stats, orient="index")

stats_dnn = load_stats("dnn_revised_stats/{classif}*.json")
stats_xgb = load_stats("xgb_revised_stats/{classif}*.json")

Precision/Recall Scatter Plot¶

fig, ax = plt.subplots(figsize=(6, 5))
ax.plot([0, 1], [0, 1], linestyle="--", color="black")
ax.scatter(
    stats_dnn["recall"], stats_xgb["recall"],
    c=counts_df.reindex(stats_dnn.index)["count"],
    cmap="viridis", edgecolors="k",
)
ax.set_xlabel("DNN Recall")
ax.set_ylabel("XGB Recall")
ax.set_title("DNN vs XGB Recall")
plt.colorbar(ax.collections[0], label="Positive examples")

Feature Importance (XGB)¶

Identify which features are most important across classifiers:

from collections import Counter

top_n = 3
top_features = []
for classif in classifications:
    files = glob.glob(f"xgb_feature_importances/{classif}*.json")
    if not files:
        continue
    with open(files[0]) as f:
        importance = json.load(f)
    features = sorted(importance, key=importance.get, reverse=True)[:top_n]
    top_features.extend(features)

feature_counts = Counter(top_features)
names, freqs = zip(*feature_counts.most_common(20))

fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(names, freqs, color="navy")
ax.set_xlabel(f"Occurrences among top {top_n} features")

Handling Different File Formats¶

When our manipulations of pandas dataframes are complete, we want to save them in an appropriate file format with the desired metadata. Our code works with multiple formats, each of which have advantages and drawbacks:

Comma Separated Values (CSV, `.csv`)¶

In this format, data are plain text and columns are separated by commas. While this format offers a high level of human readability, it also takes more space to store and a longer time to write and read than other formats.

pandas offers the read_csv() function and to_csv() method to perform I/O operations with this format. Metadata must be included as plain text in the file.

Hierarchical Data Format (HDF5, `.h5`)¶

This format stores data in binary form, so it is not human-readable. It takes up less space on disk than CSV files, and it writes/reads faster for numerical data. HDF5 does not serialize data columns containing structures like a numpy array, so file size improvements over CSV can be diminished if these structures exist in the data.

pandas includes read_hdf() and to_hdf() to handle this format, and they require a package like PyTables to work. pandas does not currently support the reading and writing of metadata using the above function and method. See scope/utils.py for code that handles metadata in HDF5 files.

Apache Parquet (`.parquet`)¶

This format stores data in binary form like HDF5, so it is not human-readable. Like HDF5, Parquet also offers significant disk space savings over CSV. Unlike HDF5, Parquet supports structures like numpy arrays in data columns.

While pandas offers read_parquet() and to_parquet() to support this format (requiring e.g. PyArrow to work), these again do not support the reading and writing of metadata associated with the dataframe. See scope/utils.py for code that reads and writes metadata in Parquet files.

Mapping Between Column Names and Fritz Taxonomies¶

The column names of training set files and Fritz taxonomy classifications are not the same by default. Training sets may also contain columns that are not meant to be uploaded to Fritz. To address both of these issues, we use a "taxonomy mapper" file to connect local data and Fritz taxonomies.

This file must currently be generated manually, entry by entry. Each entry's key corresponds to a column name in the local file. The set of all keys is used to establish the columns of interest for upload or download. For example, if the training set includes columns that are not classifications, like RA and Dec, these columns should not be included among the entries in the mapper file. The code will then ignore these columns for the purpose of classification.

The fields associated with each key are fritz_label (containing the associated Fritz classification name) and taxonomy_id identifying the classification's taxonomy system. The mapper must have the following format, also demonstrated in golden_dataset_mapper.json and DNN_AL_mapper.json:

{
  "variable": {
    "fritz_label": "variable",
    "taxonomy_id": 1012
  },
  "periodic": {
    "fritz_label": "periodic",
    "taxonomy_id": 1012
  },
  "CV": {
    "fritz_label": "Cataclysmic",
    "taxonomy_id": 1011
  }
}

Running Automated Analyses¶

The primary deliverable of SCoPe is a catalog of variable source classifications across all of ZTF. Since ZTF contains billions of light curves, this catalog requires significant compute resources to assemble. We may still want to study ZTF's expansive collection of data with SCoPe before the classification catalog is complete. For example, SCoPe classifiers can be applied to the realm of transient follow-up.

It is useful to know the classifications of any persistent ZTF sources that are close to transient candidates on the sky. Once SCoPe's primary deliverable is complete, obtaining these classifications will involve a straightforward database query. Presently, however, we must run the SCoPe workflow on a custom list of sources repeatedly to account for the rapidly changing landscape of transient events. See "Guide for Fritz Scanners" for a more detailed explanation of the workflow itself.

`cron` Job Basics¶

cron runs scripts at specific time intervals in a simple environment. While this simplicity fosters compatibility between different operating systems, the trade-off is that some extra steps are required to run scripts compared to more familiar coding environments (e.g. within scope-env for this project).

To set up a cron job, first run EDITOR=emacs crontab -e. You can replace emacs with your text editor of choice as long as it is installed on your machine. This command will open a text file in which to place cron commands. An example command is as follows:

0 */2 * * * cd scope && ~/miniforge3/envs/scope-env/bin/python ~/scope/gcn_cronjob.py > ~/scope/log_gcn_cronjob.txt 2>&1

Above, the 0 */2 * * * means that this command will run every two hours, on minute 0 of that hour. Time increments increase from left to right; in this example, the five numbers are minute, hour, day (of month), month, day (of week). The */2 means that the hour has to be divisible by 2 for the job to run. Check out crontab.guru to learn more about cron timing syntax.

Next in the line, we change directories to scope in order for the code to access our config.yaml file located in this directory. Then, ~/miniforge3/envs/scope-env/bin/python ~/scope/gcn_cronjob.py is the command that gets run (using the Python environment installed in scope-env). The > character forwards the output from the command into a log file at ~/scope/log_gcn_cronjob.txt. Finally, the 2>&1 suppresses "emails" from cron about the status of your job (unnecessary since the log is being saved to the user-specified file).

Save the text file once you finish modifying it to install the cron job. Ensure that the last line of your file is a newline to avoid issues when running. Your computer may pop up a window to which you should respond in the affirmative in order to successfully initialize the job. To check which cron jobs have been installed, run crontab -l. To uninstall your jobs, run crontab -r.

Additional Details for `cron` Environment¶

Because cron runs in a simple environment, the usual details of environment setup and paths cannot be overlooked. In order for the above job to work, we need to add more information when we run EDITOR=emacs crontab -e. The lines below will produce a successful run (if SCoPe is installed in your home directory):

PYTHONPATH = /Users/username/scope

0 */2 * * * /opt/homebrew/bin/gtimeout 2h ~/miniforge3/envs/scope-env/bin/python ~/scope/gcn_cronjob.py > ~/scope/log_gcn_cronjob.txt 2>&1

In the first line above, the PYTHONPATH environment variable is defined to include the scope directory. Without this line, any code that imports from scope will throw an error, since the user's usual PYTHONPATH variable is not accessed in the cron environment.

The second line begins with the familiar cron timing pattern described above. It continues by specifying a maximum runtime of 2 hours before timing out using the gtimeout command. On a Mac, this can be installed with homebrew by running brew install coreutils. Note that the full path to gtimeout must be specified. After the timeout comes the call to the gcn_cronjob.py script. Note that the usual #!/usr/bin/env python line at the top of SCoPe's Python scripts does not work within the cron environment. Instead, python must be explicitly specified, and in order to have access to the modules and scripts installed in scope-env we must provide a full path like the one above (~/miniforge3/envs/scope-env/bin/python). The line concludes by sending the script's output to a dedicated log file. This file gets overwritten each time the script runs.

Check if `cron` Job is Running¶

It can be useful to know whether the script within a cron job is currently running. One way to do this for gcn_cronjob.py is to run the command ps aux | grep gcn_cronjob.py. This will always return one item (representing the command you just ran), but if the script is currently running you will see more than one item.

Local Feature Generation/Inference¶

SCoPe contains a script that runs local feature generation and inference on sources specified in an input file. Example input files are contained within the tools directory (local_scope_radec.csv and local_scope_ztfid.csv). After receiving either ra/dec coordinates or ZTF light curve IDs (plus an object ID for each entry), the run-scope-local script will generate features and run inference using existing trained models, saving the results to timestamped directories. This script accepts most arguments from generate-features and scope-inference.

Additional Inputs¶

#	Argument	Description
1	`--path-dataset`	Path (from base scope directory or fully qualified) to parquet, HDF5 or CSV file containing specific sources (str)
2	`--cone-radius-arcsec`	Radius of cone search query for ZTF lightcurve IDs, if inputting ra/dec (float)
3	`--save-sources-filepath`	Path to parquet, HDF5 or CSV file to save specific sources (str)
4	`--algorithms`	ML algorithms to run (currently `dnn`/`xgb`)
5	`--group-names`	Group names of trained models (with order corresponding to `--algorithms` input)

Output: current_dt -- formatted datetime string used to label output directories.

Example Usage¶

run-scope-local --path-dataset tools/local_scope_ztfid.csv \
  --doCPU --doRemoveTerrestrial --scale_features min_max \
  --group-names DR16_stats nobalance_DR16_DNN_stats --algorithms xgb

run-scope-local --path-dataset tools/local_scope_radec.csv \
  --doCPU --write_csv --doRemoveTerrestrial \
  --group-names DR16_stats nobalance_DR16_DNN_stats --algorithms xgb dnn

Fritz Tools¶

scope-download-classification¶

Downloads classifications from Fritz and optionally merges with features from Kowalski.

Inputs:

#	Argument	Description
1	`--file`	CSV file containing obj_id and/or ra dec coordinates. Set to "parse" to download sources by group ID
2	`--group-ids`	Target group ID(s) on Fritz for download, space-separated (if CSV file not provided)
3	`--start`	Index or page number (if in "parse" mode) to begin downloading (optional)
4	`--merge-features`	Flag to merge features from Kowalski with downloaded sources
5	`--features-catalog`	Name of features catalog to query
6	`--features-limit`	Limit on number of sources to query at once
7	`--taxonomy-map`	Filename of taxonomy mapper (JSON format)
8	`--output-dir`	Name of directory to save downloaded files
9	`--output-filename`	Name of file containing merged classifications and features
10	`--output-format`	Output format of saved files, if not specified in (9). Must be one of parquet, h5, or csv
11	`--get-ztf-filters`	Flag to add ZTF filter IDs (separate catalog query) to default features
12	`--impute-missing-features`	Flag to impute missing features using `scope.utils.impute_features`
13	`--update-training-set`	If downloading an active learning sample, update the training set with the new classification based on votes
14	`--updated-training-set-prefix`	Prefix to add to updated training set file
15	`--min-vote-diff`	Minimum number of net votes (upvotes - downvotes) to keep an active learning classification. Caution: if zero, all classifications of reviewed sources will be added

Process:

If CSV file provided, query by object IDs or ra, dec
If CSV file not provided, bulk query based on group ID(s)
Get the classification/probabilities/periods of the objects in the dataset from Fritz
Append these values as new columns on the dataset, save to new file
If merge_features, query Kowalski and merge sources with features, saving new file
Fritz sources with multiple associated ZTF IDs will generate multiple rows in the merged feature file
To skip the source download, provide an input CSV file containing columns named obj_id, classification, probability, period_origin, period, ztf_id_origin, and ztf_id
Set --update-training-set to read the config-specified training set and merge new sources/classifications from an active learning group

scope-download-classification --file sample.csv --group-ids 360 361 --start 10 \
  --merge-features True --features-catalog ZTF_source_features_DR16 \
  --features-limit 5000 --taxonomy-map golden_dataset_mapper.json \
  --output-dir fritzDownload --output-filename merged_classifications_features \
  --output-format parquet --get-ztf-filters --impute-missing-features

scope-download-gcn-sources¶

Downloads sources associated with GCN events from Fritz.

Inputs:

#	Argument	Description
1	`--dateobs`	Unique dateObs of GCN event (str)
2	`--group-ids`	Group IDs to query sources, space-separated (all if not specified)
3	`--days-range`	Max days past event to search for sources (float)
4	`--radius-arcsec`	Radius (arcsec) around new sources to search for existing ZTF sources (float)
5	`--save-filename`	Filename to save source IDs/coordinates (str)

Process:

Query all sources associated with GCN event
Get Fritz names, RAs and Decs for each page of sources
Save JSON file in a useful format to use with generate-features --doSpecificIDs

scope-download-gcn-sources --dateobs 2023-05-21T05:30:43

scope-upload-classification¶

Uploads classifications and photometry to Fritz.

Inputs:

#	Argument	Description
1	`--file`	Path to CSV, HDF5 or Parquet file containing ra, dec, period, and labels
2	`--group-ids`	Target group ID(s) on Fritz for upload, space-separated
3	`--classification`	Name(s) of input file columns containing classification probabilities (one column per label). Set to "read" to automatically upload all classes specified in the taxonomy mapper
4	`--taxonomy-map`	Filename of taxonomy mapper (JSON format)
5	`--comment`	Comment to post (if specified)
6	`--start`	Index to start uploading (zero-based)
7	`--stop`	Index to stop uploading (inclusive)
8	`--classification-origin`	Origin of classifications. If "SCoPe" (default), Fritz will apply custom color-coding
9	`--skip-phot`	Flag to skip photometry upload (skips for existing sources only)
10	`--post-survey-id`	Flag to post an annotation for the Gaia, AllWISE or PS1 ID associated with each source
11	`--survey-id-origin`	Annotation origin name for survey_id
12	`--p-threshold`	Probability threshold for posted classification (values must be >= this number to post)
13	`--match-ids`	Flag to match input and existing survey_id values during upload. It is recommended to instead match obj_ids (see next line)
14	`--use-existing-obj-id`	Flag to use existing source names in a column named "obj_id" (a coordinate-based ID is otherwise generated by default)
15	`--post-upvote`	Flag to post an upvote to newly uploaded classifications. Not recommended when posting automated classifications for active learning
16	`--check-labelled-box`	Flag to check the "labelled" box for each source when uploading classifications. Not recommended when posting automated classifications for active learning
17	`--write-obj-id`	Flag to output a copy of the input file with an "obj_id" column containing the coordinate-based IDs for each posted object. Use this file as input for future uploads to add to this column
18	`--result-dir`	Name of directory where upload results file is saved. Default is "fritzUpload" within the tools directory
19	`--result-filetag`	Name of tag appended to the result filename. Default is "fritzUpload"
20	`--result-format`	Result file format; one of csv, h5 or parquet. Default is parquet
21	`--replace-classifications`	Flag to delete each source's existing classifications before posting new ones
22	`--radius-arcsec`	Photometry search radius for uploaded sources
23	`--no-ml`	Flag to post classifications that do not originate from an ML classifier
24	`--post-phot-as-comment`	Flag to post photometry as a comment on the source
25	`--post-phasefolded-phot`	Flag to post phase-folded photometry as comment in addition to time series
26	`--phot-dirname`	Name of directory in which to save photometry plots (str)
27	`--instrument-name`	Name of instrument used for observations (str)

Process:

Include Kowalski host, port, protocol, and token or username+password in config.yaml
Check if each input source exists by comparing input and existing obj_ids and/or survey_ids
Save the objects to Fritz group if new
In batches, upload the classifications of the objects in the dataset to target group on Fritz
Duplicate classifications will not be uploaded to Fritz. If n classifications are manually specified, probabilities will be sourced from the last n columns of the dataset
Post survey_id annotations
(Post comment to each uploaded source)

scope-upload-classification --file sample.csv --group-ids 500 250 750 \
  --classification variable flaring --taxonomy-map map.json \
  --comment confident --start 35 --stop 50 --skip-phot \
  --p-threshold 0.9 --write-obj-id --result-format csv \
  --use-existing-obj-id --post-survey-id --replace-classifications

scope-manage-annotation¶

Manages annotations on Fritz sources (post, update, or delete).

Inputs:

#	Argument	Description
1	`--action`	One of "post", "update", or "delete"
2	`--source`	ZTF ID or path to `.csv` file with multiple objects (ID column "obj_id")
3	`--group-ids`	Target group ID(s) on Fritz, space-separated
4	`--origin`	Name of annotation
5	`--key`	Name of annotation
6	`--value`	Value of annotation (required for "post" and "update" -- if source is a `.csv` file, value will auto-populate from `source[key]`)

Process:

For each source, find existing annotations (for "update" and "delete" actions)
Interact with API to make desired changes to annotations
Confirm changes with printed messages

scope-manage-annotation --action post --source sample.csv \
  --group_ids 200 300 400 --origin revisedperiod --key period

Running Inference¶

Overview¶

Running Inference on HPC Resources¶

Examining Predictions¶

Analyzing Predictions¶

Comparing DNN and XGB Scores¶

DNN vs XGB Histogram¶

DNN vs XGB Agreement Heatmap¶

Training Set Label Distribution¶

Evaluating Training Results¶

Precision/Recall Scatter Plot¶

Feature Importance (XGB)¶

Handling Different File Formats¶

Comma Separated Values (CSV, .csv)¶

Hierarchical Data Format (HDF5, .h5)¶

Apache Parquet (.parquet)¶

Mapping Between Column Names and Fritz Taxonomies¶

Running Automated Analyses¶

cron Job Basics¶

Additional Details for cron Environment¶

Check if cron Job is Running¶

Local Feature Generation/Inference¶

Additional Inputs¶

Example Usage¶

Fritz Tools¶

scope-download-classification¶

scope-download-gcn-sources¶

scope-upload-classification¶

scope-manage-annotation¶

Comma Separated Values (CSV, `.csv`)¶

Hierarchical Data Format (HDF5, `.h5`)¶

Apache Parquet (`.parquet`)¶

`cron` Job Basics¶

Additional Details for `cron` Environment¶

Check if `cron` Job is Running¶