Skip to content

Output Files

Understanding the files generated by StringSight analysis.

When you run explain() or label() with an output_dir, StringSight saves comprehensive results across multiple file formats. This guide explains each output file and how to use them.

Core Output Files

1. clustered_results.parquet

Primary results file with all analysis data

This is the main output file containing your original data enriched with extracted properties and cluster assignments.

import pandas as pd

# Load complete results
df = pd.read_parquet("results/clustered_results.parquet")

# Key columns added by StringSight:
print(df.columns)
# ['question_id', 'model', 'model_response',           # Original data
#  'property_description', 'property_evidence',       # Extracted properties  
#  'property_description_cluster_id',                  # Cluster assignments
#  'property_description_cluster_label']               # Human-readable cluster names

# Example: Find all responses in a specific cluster
cluster_data = df[df['property_description_cluster_label'] == 'Detailed Technical Explanations']

Use this file for: - Interactive analysis and visualization - Building custom dashboards - Statistical analysis of results - Feeding into downstream ML pipelines

2. Metrics DataFrames (JSONL)

Model- and cluster-level metrics as DataFrames

These files are optimized for frontend and analysis workflows:

  • model_cluster_scores_df.jsonl — Per model-cluster metrics
  • cluster_scores_df.jsonl — Per cluster aggregated metrics
  • model_scores_df.jsonl — Per model aggregated metrics
import pandas as pd

model_cluster = pd.read_json("results/model_cluster_scores_df.jsonl", lines=True)
cluster_scores = pd.read_json("results/cluster_scores_df.jsonl", lines=True)
model_scores = pd.read_json("results/model_scores_df.jsonl", lines=True)

# Example: top clusters for a given model
gpt4 = model_cluster[model_cluster["model"] == "gpt-4"]
print(gpt4.sort_values("proportion", ascending=False).head(10)[["cluster", "proportion"]])

Use these files for: - Model leaderboards and rankings - Performance comparisons - Quality assessment reports - Automated model selection

3. full_dataset.json

Complete dataset for reanalysis and caching

Contains the entire PropertyDataset object with all conversations, properties, clusters, and metadata.

from stringsight.core.data_objects import PropertyDataset

# Load complete dataset
dataset = PropertyDataset.load("results/full_dataset.json")

# Access all components:
print(f"Conversations: {len(dataset.conversations)}")
print(f"Properties: {len(dataset.properties)}")  
print(f"Clusters: {len(dataset.clusters)}")
print(f"Models: {dataset.all_models}")

# Rerun metrics with different parameters
from stringsight import compute_metrics_only
clustered_df, new_stats = compute_metrics_only(
    "results/full_dataset.json",
    method="single_model",
    output_dir="results_updated/"
)

Use this file for: - Recomputing metrics without re-extracting properties - Debugging and troubleshooting - Building analysis pipelines - Sharing complete analysis state

Additional Output Files

Processing Stage Files

Property Extraction: - raw_properties.jsonl - Raw LLM responses before parsing - extraction_stats.json - API call statistics and timing - extraction_samples.jsonl - Sample inputs/outputs for debugging

JSON Parsing: - parsed_properties.jsonl - Successfully parsed property objects - parsing_stats.json - Parsing success/failure statistics
- parsing_failures.jsonl - Failed parsing attempts for debugging

Validation: - validated_properties.jsonl - Properties that passed validation - validation_stats.json - Validation statistics

Clustering: - embeddings.parquet - Property embeddings (if include_embeddings=True) - clustered_results_lightweight.jsonl - Results without embeddings - summary_table.jsonl - Cluster summary statistics

Metrics: - model_cluster_scores_df.jsonl - Per model-cluster performance (DataFrame JSONL) - cluster_scores_df.jsonl - Aggregate cluster metrics (DataFrame JSONL) - model_scores_df.jsonl - Aggregate model metrics (DataFrame JSONL)

Summary Files

summary.txt - Human-readable analysis summary

StringSight Results Summary
==================================================

Total conversations: 1,234
Total properties: 4,567  
Models analyzed: 8
Fine clusters: 23
Coarse clusters: 8

Model Rankings (by average quality score):
  1. gpt-4: 0.847
  2. claude-3: 0.832
  3. gemini-pro: 0.801
  ...

Working with Output Files

Loading Results for Analysis

import pandas as pd
import json

# Quick analysis workflow
df = pd.read_parquet("results/clustered_results.parquet")
with open("results/model_stats.json") as f:
    stats = json.load(f)

# Analyze cluster distributions
cluster_counts = df['property_description_cluster_label'].value_counts()
print("Top behavioral patterns:")
print(cluster_counts.head(10))

# Compare models within specific clusters  
for cluster in cluster_counts.head(5).index:
    cluster_data = df[df['property_description_cluster_label'] == cluster]
    model_dist = cluster_data['model'].value_counts()
    print(f"\n{cluster}:")
    print(model_dist)

Rerunning Analysis

from stringsight import compute_metrics_only

# Recompute metrics with different parameters
clustered_df, model_stats = compute_metrics_only(
    input_path="results/full_dataset.json",
    method="single_model", 
    metrics_kwargs={
        'compute_confidence_intervals': True,
        'bootstrap_samples': 1000
    },
    output_dir="results_with_ci/"
)

Building Custom Visualizations

# Interactive visualization with plotly
import plotly.express as px
import pandas as pd

# Load results
df = pd.read_parquet("results/clustered_results.parquet")

# Build interactive filters
selected_models = ["gpt-4", "claude-3"]  # Filter by model
selected_clusters = df['property_description_cluster_label'].unique()[:10]  # Top clusters

# Filter and display
filtered_df = df[
    (df['model'].isin(selected_models)) & 
    (df['property_description_cluster_label'].isin(selected_clusters))
]

# Create interactive plot
fig = px.bar(filtered_df, x='model', y='property_description_cluster_label', 
             title='Model Behavior Comparison')
fig.show()

File Format Details

Parquet vs JSON vs JSONL

Parquet (.parquet) - Binary format, fastest loading - Preserves data types - Best for analysis and large datasets - Use: pd.read_parquet()

JSON (.json) - Human-readable structure - Good for configuration and metadata - Use: json.load()

JSONL (.jsonl)
- Newline-delimited JSON - Streamable for large datasets - Each line is a JSON object - Use: pd.read_json(..., lines=True)

Best Practices

1. File Organization

results/
├── clustered_results.parquet      # Primary analysis file
├── model_cluster_scores_df.jsonl  # Per model-cluster metrics (DF JSONL)
├── cluster_scores_df.jsonl        # Per cluster metrics (DF JSONL)
├── model_scores_df.jsonl          # Per model metrics (DF JSONL)
├── full_dataset.json              # Complete state
├── summary.txt                    # Human summary
├── embeddings.parquet             # Embeddings (optional)
└── stage_outputs/                 # Detailed processing files
    ├── parsed_properties.jsonl
    ├── validation_stats.json
    └── ...

2. Version Control

  • Include summary.txt and model_stats.json in version control
  • Use .gitignore for large binary files like embeddings
  • Tag important analysis runs

3. Reproducibility

  • Save the exact command/parameters used
  • Keep full_dataset.json for reanalysis
  • Document any post-processing steps

Next Steps

Task Descriptions

Task descriptions let you steer property extraction toward a specific domain or evaluation goal. When provided, StringSight formats a task-aware system prompt (for both single_model and side_by_side variants) using templates from stringsight/prompts/extractor_prompts.py.

Example:

clustered_df, model_stats = explain(
    df,
    method="single_model",
    task_description=(
        "Evaluate customer support responses for empathy, clarity, "
        "resolution accuracy, and policy adherence."
    ),
    output_dir="results/customer_support",
)

Default Task Description (WebDev Arena helper)

When running scripts/run_webdev_arena.py, the following default task description is used unless overridden with --task_description (or disabled with --no_task_description):

Each model is given a user prompt to generate a web development project.

When looking for interesting properties of responses, consider the following (note these are not exhaustive):
1. **Code Quality**: Correctness, best practices, security vulnerabilities, and adherence to modern web standards
2. **Completeness**: Whether the implementation fully addresses the user's requirements and includes necessary dependencies
3. **User Experience**: UI/UX quality, accessibility, responsiveness, and visual appeal
4. **Maintainability**: Code organization, documentation, comments, and readability
5. **Functionality**: Whether the code would actually work as intended, proper error handling, and edge case coverage
6. **Performance**: Efficient implementations, loading times, and resource usage
7. **Stylistic Choices**: The model's choices in terms of language, formatting, layout, and style
8. **User interpretation**: If given vauge instructions, what design choices does the model make to try to fulfill the user's requirements?
9. **Safety**: Whether the model's response contains vulernabilities or if it generates content that another model would consider unsafe or harmful.

For full prompt templates, see stringsight/prompts/extractor_prompts.py.

Prompts (optional details)

View exact extraction and clustering prompts