Evaluation Process

Participants are allowed to merge the subsets and train with external datasets.

Participants will provide:

1. Pairwise distances from all images to all images in each subset

2. Binary label (0 = authentic, 1 = imitation) for each image in each subset, or a merged CSV file covering all three subsets with the labels “ “Pizan” or “not-Pizan”, “Tasso” or “not-Tasso”, “Yaqut” or “not-Yaqut”.

3. Documentation of the classification process, namely which script features influence the distance between images and the final classification

As the task is a crisp classification task, the ranking is based on the average accuracy of the output 2. “Classification label”.

With:
True Negative (TN): Authentic correctly identified as authentic
False Positive (FP): Imitation incorrectly identified as authentic
False Negative (FN): Authentic incorrectly identified as imitation
True Positive (TP): Imitation correctly identified as imitation

 

Standard metrics are:
Accuracy: Acc = (TP+TN) / (TP+TN+FP+FN)
Precision (Positive Predictive Value): Prec=TP / (TP+FP)
Recall (Sensitivity, True Positive Rate): Rec=TP / (TP+FN)
Specificity (True Negative Rate): Spec=TN / (TN+FP)

 

The overall ranking is based on the Unweighted Average Recall (UAR), also known as Balanced Accuracy or macro-averaged recall. This metric ensures fair evaluation across subsets of different sizes by computing accuracy for each subset independently and then averaging.

Let S be the total number of subsets. Then the Overall Accuracy is:

 Image1.png

 

where:

● N i = number of samples in subset iii,

Image2.pngy^ j = predicted label for sample j in subset i,

Image3.pngy j = true label,

● 1{·} = 1 if the prediction is correct, 0 otherwise.

 

We will not apply asymmetric costs between false positives and false negatives. While in forensic authentication tasks, accepting a forgery (FP) is typically considered more severe than rejecting an authentic work (FN), there is no established consensus in the literature on the appropriate cost ratio for writer identification tasks. For transparency and comparability, we therefore use symmetric evaluation.

All patches and images are weighted equally in the evaluation, regardless of differences in textual content length. This design choice prioritizes:

- Simplicity and transparency of evaluation
- Comparability across diverse submissions
- Avoidance of subjective weighting schemes

We acknowledge that longer texts may provide more discriminative information, but leave content-aware weighting to future iterations of the competition.

Distance-Based Retrieval Metrics (Diagnostic)

In addition to classification accuracy, we evaluate the quality of the learned distance metric through retrieval metrics. These metrics help diagnose whether errors stem from poor feature representations or suboptimal classification thresholds.

 

● Top-k Retrieval Accuracy

For each image in a subset, we rank all other images by their distance (ascending order) and check if images of the same class appear in the top-k nearest neighbors. 

Top-k Accuracy is defined as:

 Image4.png

where:

- N = total number of images in the subset
- top-k(i) = set of k nearest neighbors to image i (excluding i itself)
- y_i = true label of image i
- 1{·} = indicator function (1 if at least one same-class image is in top-k)

We report Top-1, Top-3, and Top-5 Accuracy, averaged across all subsets:

Image5.png

Although the task is one of crisp classification, the measure of top-k accuracy will complement the observation on accuracy if the distance metric is good, but the classification threshold needs tuning.

 

Ranking Impact

These metrics are NOT used for the primary ranking but will be:
- Reported on the leaderboard for analysis
- Used as tie-breakers (in order: Top-1, Top-3, Top-5) if UAR scores are equal after all other tie-breaking criteria
- Included in the competition report to identify different system strengths

Loading... Loading...