Skip to content

Performance Tuning

Optimize StringSight for speed, cost, and quality based on your requirements.

Quick Wins

Use Cheaper Models

from stringsight import explain

# Cost-effective configuration
clustered_df, model_stats = explain(
    df,
    model_name="gpt-4.1-mini",              
    embedding_model="all-MiniLM-L6-v2",     # Free local model
    min_cluster_size=15,                     # Smaller clusters = more clusters
    use_wandb=False                          # Disable W&B logging (default True)
)

Use Local Embeddings

# Local sentence-transformers (free, no API calls)
clustered_df, model_stats = explain(
    df,
    embedding_model="all-MiniLM-L6-v2",  # Fast, good quality
    # or
    embedding_model="all-mpnet-base-v2"   # Higher quality, slower
)

Sample Large Datasets

# Analyze subset for initial exploration
from stringsight.dataprep import sample_prompts_evenly

df_sample = sample_prompts_evenly(
    df,
    sample_size=1000,  # Sample 1000 prompts
    method="single_model",
    random_state=42
)

clustered_df, model_stats = explain(df_sample)

Model Selection Trade-offs

Model Cost (per 1M tokens) Speed Quality Best For
gpt-4.1 $3.50 input / $14.00 output Slow Excellent Production
gpt-4.1-mini $0.70 / $2.80 Medium Very Good Balanced
gpt-4.1-mini $0.60 / $1.80 Fast Good Development
gpt-4.1-nano $0.20 / $0.80 Very Fast Decent Large-scale

Embedding Models

Model Cost Speed Quality
text-embedding-3-large $0.13/1M Medium Excellent
text-embedding-3-large $0.02/1M Fast Very Good
all-MiniLM-L6-v2 Free Very Fast Good
all-mpnet-base-v2 Free Medium Very Good

Clustering Optimization

Adjust Cluster Size

# Larger clusters = faster, fewer clusters
clustered_df, model_stats = explain(
    df,
    min_cluster_size=50  # vs default 30
)

Disable Dimensionality Reduction

from stringsight.clusterers import HDBSCANClusterer

clusterer = HDBSCANClusterer(
    disable_dim_reduction=True,  # Skip dimensionality reduction
    min_cluster_size=30
)

Parallelization

Increase Workers

# More parallel API calls (if rate limits allow)
clustered_df, model_stats = explain(
    df,
    max_workers=32  # vs default 16
)

Batch Processing

# Process large datasets in batches
import pandas as pd

batch_size = 1000
results = []

for i in range(0, len(df), batch_size):
    batch = df[i:i+batch_size]
    result, _ = explain(batch, output_dir=f"results/batch_{i}")
    results.append(result)

# Combine results
final_df = pd.concat(results, ignore_index=True)

Caching

# Cache expensive operations
clustered_df, model_stats = explain(
    df,
    extraction_cache_dir=".cache/extraction",
    clustering_cache_dir=".cache/clustering",
    metrics_cache_dir=".cache/metrics"
)

Memory Management

For Large Datasets

# Reduce memory usage
clustered_df, model_stats = explain(
    df,
    include_embeddings=False,  # Don't include embeddings in output
    min_cluster_size=50,        # Fewer clusters
    use_wandb=False             # Reduce logging overhead (default True)
)

Chunk Processing

# Process in chunks to avoid OOM
for chunk in pd.read_csv("large_file.csv", chunksize=5000):
    result, _ = explain(chunk, output_dir="results/chunk")

Benchmarks

Typical performance on common hardware:

Dataset Size GPT-4.1 gpt-4.1-mini Local Embeddings Total Time
100 convs 2 min 1 min 10 sec ~3 min
1,000 convs 15 min 8 min 30 sec ~16 min
10,000 convs 2.5 hours 1.3 hours 5 min ~2.6 hours

Benchmarks on M1 Mac with 32GB RAM, 16 parallel workers

Next Steps