🤺 Evaluation criteria

Evaluation will be performed on held-out test cases of 200 patients. Test cases are split in two subgroups: 100 are drawn from the same hospital as the training cases (University Hospital Tübingen, Germany) and 100 are drawn from a different hospital (University Hospital of LMU in Munich, Germany) with similar acquisition protocols.

A combination of two metrics reflecting the aims and specific challenges for the task of PET lesion segmentation:

  1. Foreground Dice score of segmented lesions
  2. Volume of false positive connected components that do not overlap with positives (=false positive volume)
  3. Volume of positive connected components in the ground truth that do not overlap with the estimated segmentation mask (=false negative volume)

In case of test data that do not contain positives (no FDG-avid lesions), only metric 2 will be used.

Figure: Example for the evaluation. The Dice score is calculated to measure the correct overlap between predicted lesion segmentation (blue) and ground truth (red). Additionally special emphasis is put on false positives by measuring their volume, i.e. large false positives like brain or bladder will result in a low score and false negatives by measuring their volume (i.e. entirely missed lesions).

A python script computing these evaluation metrics is provided under https://github.com/lab-midas/autoPET.


📈 Ranking

The submitted algorithms will be ranked according to:

Step 1: Separate rankings will be computed based on each metric (for metric 1: higher Dice score = better, for metrics 2 and 3: lower volumes = better)

Step 2: From the three ranking tables, the mean ranking of each participant will be computed as the numerical mean of the single rankings (metric 1: 50 % weight, metrics 2 and 3: 25 % weight each)

Step 3: In case of equal ranking, the achieved Dice metric will be used as a tie break.