Automated Multi-Provider LLM Benchmarking with GitHub Actions

In the rapidly evolving landscape of Large Language Models (LLMs), developers face a critical challenge: token efficiency. When building real-time signal analysis tools—such as those used for stock market monitoring, IoT sensor tracking, or blockchain event indexing—feeding raw time-series data into an LLM can be prohibitively expensive. The serialization format you choose directly impacts your bottom line. To solve this, we built an automated, multi-provider benchmark system that evaluates data formats across major providers like OpenAI, Anthropic, and DeepSeek using n1n.ai.

The Problem: The High Cost of Verbosity

Time-series data is structurally simple: it consists of timestamps and values. However, the industry-standard serialization format, JSON, is notoriously verbose for this use case. In a JSON array of objects, the keys (e.g., "timestamp", "price") are repeated for every single data point. In the world of LLMs, every character counts toward the token total. For a developer using high-performance models via n1n.ai, reducing this repetition isn't just a technical optimization; it is a significant cost-saving measure.

We discovered that while CSV is better than JSON, it still carries redundant information. To push the boundaries of efficiency, we developed TSLN (Time-Series Lean Notation). This format leverages temporal regularity and delta encoding to compress data before it ever hits the API endpoint. But to prove TSLN's superiority, we needed a robust, reproducible, and transparent benchmarking system.

Architectural Overview: The GitOps Pipeline

Our benchmark system is designed to be "set and forget." It runs every two weeks, tests four data formats across four major LLM providers, and publishes results directly to a live dashboard. The architecture relies on three core pillars:

Execution Layer: GitHub Actions handles the scheduling and environment setup.
Logic Layer: A Python-based runner that interfaces with LLM APIs (optimized via n1n.ai for stability).
Presentation Layer: A Next.js frontend that fetches static JSON results from the repository.

The Data Formats Under Test

We compare four distinct serialization strategies:

JSON: The baseline. Full object notation.
CSV: Header row with comma-separated values.
TOON (Token-Oriented Object Notation): A pipe-delimited format designed to minimize whitespace tokens.
TSLN (Time-Series Lean Notation): Our custom compact format that uses a single base timestamp and an interval.

Implementing the Python Benchmark Runner

The core of the system is a Python script that generates sample data and estimates costs based on real-world pricing for models like GPT-4o mini, Claude 3.5 Sonnet, and DeepSeek-V3. Here is how we generate the TSLN format vs. the JSON baseline:

import json

def generate_benchmark_data(format_name: str, count: int = 100):
    if format_name == "json":
        return json.dumps([
            {"t": f"2024-01-01T09:{str(i).zfill(2)}:00Z", "v": 150.0 + i}
            for i in range(count)
        ])

    elif format_name == "tsln":
        # Base time | Interval in seconds | Values
        values = [str(150.0 + i) for i in range(count)]
        return "t:2024-01-01T09:00:00Z|i:60|v:" + ",".join(values)

    # ... other formats (CSV, TOON) logic

Calculating Token Costs

Tokenization varies by provider. While OpenAI uses tiktoken, Anthropic has its own logic. To provide a fair comparison, we use a normalized heuristic (~4 characters per token) and apply the latest pricing tiers. For developers looking for the most competitive rates, n1n.ai provides a unified gateway to compare these costs in real-time.

def run_single_benchmark(provider: str, model: str, format_name: str, data: str):
    input_tokens = len(data) / 4  # Heuristic for comparison

    # Current Market Rates (USD per 1M tokens)
    pricing = {
        "openai": 0.15,      # gpt-4o-mini
        "anthropic": 0.80,   # claude-3-haiku
        "deepseek": 0.14,    # deepseek-v3
        "google": 0.075      # gemini-1.5-flash
    }

    cost = (input_tokens / 1_000_000) * pricing.get(provider, 1.0)
    return {"provider": provider, "format": format_name, "cost": cost}

Automating with GitHub Actions

To ensure our benchmarks stay relevant as models update (like the transition from GPT-4 to OpenAI o3 or DeepSeek-V2 to V3), we use a GitHub Actions workflow. This workflow not only runs the code but also commits the results back to the repository as a static JSON file. This "GitOps" approach eliminates the need for a database.

name: LLM Benchmark Cron
on:
  schedule:
    - cron: '0 0 */14 * 0' # Bi-weekly
  workflow_dispatch:

jobs:
  run-benchmarks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - name: Run Script
        env:
          N1N_API_KEY: ${{ secrets.N1N_API_KEY }}
        run: python run_benchmarks.py
      - name: Commit Results
        run: |
          git config user.name "GitHub Actions"
          git add public/data/results.json
          git commit -m "Update benchmarks [automated]"
          git push

Visualizing the Results with Recharts

The frontend is built with Next.js and Recharts. Because the data is stored as a static JSON file in the public folder, the site is incredibly fast and can be hosted on platforms like Railway or Vercel with zero backend overhead.

Our latest results show a staggering difference in efficiency:

Format	Avg Tokens	Cost per 100k Points	Savings vs JSON
JSON	1,397	$0.0404	Baseline
CSV	698	$0.0202	50.0%
TSLN	177	$0.0052	87.3%

Pro Tip: Tokenization and Numbers

One common mistake developers make is assuming that all numbers are tokenized equally. In many LLM tokenizers, single digits are one token, but multi-digit numbers might be split unexpectedly. By using TSLN, we reduce the total character count, which statistically lowers the probability of "token fragmentation." When you access models via n1n.ai, you can experiment with these formats across different providers to see which tokenizer handles your specific data distribution most efficiently.

Conclusion

Building a multi-provider benchmark isn't just about finding the cheapest model; it's about optimizing how you talk to that model. By automating this process with GitHub Actions and utilizing a unified API aggregator like n1n.ai, you can maintain a high-performance RAG or signal analysis system without breaking the bank.

Get a free API key at n1n.ai

Source: https://dev.to/manasmudbari/building-a-multi-provider-llm-benchmark-with-automated-github-actions-hk6