Output Files¶

Understanding the files generated by StringSight analysis.

When you run explain() or label() with an output_dir, StringSight saves comprehensive results across multiple file formats. This guide explains each output file and how to use them.

Core Output Files¶

1. `clustered_results.parquet`¶

Primary results file with all analysis data

This is the main output file containing your original data enriched with extracted properties and cluster assignments.

import pandas as pd

# Load complete results
df = pd.read_parquet("results/clustered_results.parquet")

# Key columns added by StringSight:
print(df.columns)
# ['question_id', 'model', 'model_response',           # Original data
#  'property_description', 'property_evidence',       # Extracted properties  
#  'property_description_cluster_id',                  # Cluster assignments
#  'property_description_cluster_label']               # Human-readable cluster names

# Example: Find all responses in a specific cluster
cluster_data = df[df['property_description_cluster_label'] == 'Detailed Technical Explanations']

Use this file for: - Interactive analysis and visualization - Building custom dashboards - Statistical analysis of results - Feeding into downstream ML pipelines

2. Metrics DataFrames (JSONL)¶

Model- and cluster-level metrics as DataFrames

These files are optimized for frontend and analysis workflows:

model_cluster_scores_df.jsonl — Per model-cluster metrics
cluster_scores_df.jsonl — Per cluster aggregated metrics
model_scores_df.jsonl — Per model aggregated metrics

import pandas as pd

model_cluster = pd.read_json("results/model_cluster_scores_df.jsonl", lines=True)
cluster_scores = pd.read_json("results/cluster_scores_df.jsonl", lines=True)
model_scores = pd.read_json("results/model_scores_df.jsonl", lines=True)

# Example: top clusters for a given model
gpt4 = model_cluster[model_cluster["model"] == "gpt-4"]
print(gpt4.sort_values("proportion", ascending=False).head(10)[["cluster", "proportion"]])

Use these files for: - Model leaderboards and rankings - Performance comparisons - Quality assessment reports - Automated model selection

3. `full_dataset.json`¶

Complete dataset for reanalysis and caching

Contains the entire PropertyDataset object with all conversations, properties, clusters, and metadata.

from stringsight.core.data_objects import PropertyDataset

# Load complete dataset
dataset = PropertyDataset.load("results/full_dataset.json")

# Access all components:
print(f"Conversations: {len(dataset.conversations)}")
print(f"Properties: {len(dataset.properties)}")  
print(f"Clusters: {len(dataset.clusters)}")
print(f"Models: {dataset.all_models}")

# Rerun metrics with different parameters
from stringsight import compute_metrics_only
clustered_df, new_stats = compute_metrics_only(
    "results/full_dataset.json",
    method="single_model",
    output_dir="results_updated/"
)

Use this file for: - Recomputing metrics without re-extracting properties - Debugging and troubleshooting - Building analysis pipelines - Sharing complete analysis state

Additional Output Files¶

Processing Stage Files¶

Property Extraction: - raw_properties.jsonl - Raw LLM responses before parsing - extraction_stats.json - API call statistics and timing - extraction_samples.jsonl - Sample inputs/outputs for debugging

JSON Parsing: - parsed_properties.jsonl - Successfully parsed property objects - parsing_stats.json - Parsing success/failure statistics
- parsing_failures.jsonl - Failed parsing attempts for debugging

Validation: - validated_properties.jsonl - Properties that passed validation - validation_stats.json - Validation statistics

Clustering: - embeddings.parquet - Property embeddings (if include_embeddings=True) - clustered_results_lightweight.jsonl - Results without embeddings - summary_table.jsonl - Cluster summary statistics

Metrics: - model_cluster_scores_df.jsonl - Per model-cluster performance (DataFrame JSONL) - cluster_scores_df.jsonl - Aggregate cluster metrics (DataFrame JSONL) - model_scores_df.jsonl - Aggregate model metrics (DataFrame JSONL)

Summary Files¶

summary.txt - Human-readable analysis summary

StringSight Results Summary
==================================================

Total conversations: 1,234
Total properties: 4,567  
Models analyzed: 8
Fine clusters: 23
Coarse clusters: 8

Model Rankings (by average quality score):
  1. gpt-4: 0.847
  2. claude-3: 0.832
  3. gemini-pro: 0.801
  ...

Working with Output Files¶

Loading Results for Analysis¶

import pandas as pd
import json

# Quick analysis workflow
df = pd.read_parquet("results/clustered_results.parquet")
with open("results/model_stats.json") as f:
    stats = json.load(f)

# Analyze cluster distributions
cluster_counts = df['property_description_cluster_label'].value_counts()
print("Top behavioral patterns:")
print(cluster_counts.head(10))

# Compare models within specific clusters  
for cluster in cluster_counts.head(5).index:
    cluster_data = df[df['property_description_cluster_label'] == cluster]
    model_dist = cluster_data['model'].value_counts()
    print(f"\n{cluster}:")
    print(model_dist)

Rerunning Analysis¶

from stringsight import compute_metrics_only

# Recompute metrics with different parameters
clustered_df, model_stats = compute_metrics_only(
    input_path="results/full_dataset.json",
    method="single_model", 
    metrics_kwargs={
        'compute_confidence_intervals': True,
        'bootstrap_samples': 1000
    },
    output_dir="results_with_ci/"
)

Building Custom Visualizations¶

# Interactive visualization with plotly
import plotly.express as px
import pandas as pd

# Load results
df = pd.read_parquet("results/clustered_results.parquet")

# Build interactive filters
selected_models = ["gpt-4", "claude-3"]  # Filter by model
selected_clusters = df['property_description_cluster_label'].unique()[:10]  # Top clusters

# Filter and display
filtered_df = df[
    (df['model'].isin(selected_models)) & 
    (df['property_description_cluster_label'].isin(selected_clusters))
]

# Create interactive plot
fig = px.bar(filtered_df, x='model', y='property_description_cluster_label', 
             title='Model Behavior Comparison')
fig.show()

File Format Details¶

Parquet vs JSON vs JSONL¶

Parquet (.parquet) - Binary format, fastest loading - Preserves data types - Best for analysis and large datasets - Use: pd.read_parquet()

JSON (.json) - Human-readable structure - Good for configuration and metadata - Use: json.load()

JSONL (.jsonl)
- Newline-delimited JSON - Streamable for large datasets - Each line is a JSON object - Use: pd.read_json(..., lines=True)

Best Practices¶

1. File Organization¶

results/
├── clustered_results.parquet      # Primary analysis file
├── model_cluster_scores_df.jsonl  # Per model-cluster metrics (DF JSONL)
├── cluster_scores_df.jsonl        # Per cluster metrics (DF JSONL)
├── model_scores_df.jsonl          # Per model metrics (DF JSONL)
├── full_dataset.json              # Complete state
├── summary.txt                    # Human summary
├── embeddings.parquet             # Embeddings (optional)
└── stage_outputs/                 # Detailed processing files
    ├── parsed_properties.jsonl
    ├── validation_stats.json
    └── ...

2. Version Control¶

Include summary.txt and model_stats.json in version control
Use .gitignore for large binary files like embeddings
Tag important analysis runs

3. Reproducibility¶

Save the exact command/parameters used
Keep full_dataset.json for reanalysis
Document any post-processing steps

Next Steps¶

Use the quickstart guide to generate these files
Learn about explain() and label() functions
Explore visualization options

Task Descriptions¶

Task descriptions let you steer property extraction toward a specific domain or evaluation goal. When provided, StringSight formats a task-aware system prompt (for both single_model and side_by_side variants) using templates from stringsight/prompts/extractor_prompts.py.

Example:

clustered_df, model_stats = explain(
    df,
    method="single_model",
    task_description=(
        "Evaluate customer support responses for empathy, clarity, "
        "resolution accuracy, and policy adherence."
    ),
    output_dir="results/customer_support",
)

Default Task Description (WebDev Arena helper)¶

When running scripts/run_webdev_arena.py, the following default task description is used unless overridden with --task_description (or disabled with --no_task_description):

Each model is given a user prompt to generate a web development project.

When looking for interesting properties of responses, consider the following (note these are not exhaustive):
1. **Code Quality**: Correctness, best practices, security vulnerabilities, and adherence to modern web standards
2. **Completeness**: Whether the implementation fully addresses the user's requirements and includes necessary dependencies
3. **User Experience**: UI/UX quality, accessibility, responsiveness, and visual appeal
4. **Maintainability**: Code organization, documentation, comments, and readability
5. **Functionality**: Whether the code would actually work as intended, proper error handling, and edge case coverage
6. **Performance**: Efficient implementations, loading times, and resource usage
7. **Stylistic Choices**: The model's choices in terms of language, formatting, layout, and style
8. **User interpretation**: If given vauge instructions, what design choices does the model make to try to fulfill the user's requirements?
9. **Safety**: Whether the model's response contains vulernabilities or if it generates content that another model would consider unsafe or harmful.

For full prompt templates, see stringsight/prompts/extractor_prompts.py.

Prompts (optional details)¶

View exact extraction and clustering prompts

Extraction (single_model)

Clustering (label generation & dedup)