Skip to main content
§ Product

Alternative Data Signal Engine

A production pipeline that ingests one or more unconventional datasets, normalizes against a fund-internal schema, and serves processed factor scores, back-tested signals, and event-time alerts to the research stack.

Engagement
8–14 week build · ongoing data ops
Built for
Quant PMs · CIOs · Research bench leads
§ Problem

The data you want lives outside the canonical feeds, and the pipeline that ingests it cleanly — point-in-time, backfilled, idempotent — doesn't yet exist.

What this is

A custom pipeline for funds that have identified a specific alt-data signal hypothesis and need the production-grade plumbing to test, deploy, and operate against it. The engagement covers four bands:

  • Ingestion. Schema validation at the boundary, idempotent writes, replay from any checkpoint, backfill of the historical depth your backtest infra needs.
  • Normalization. Point-in-time alignment (no look-ahead), survivorship correction, vendor-format → fund-internal schema mapping. Universe resolution against your master security file.
  • Modeling. Feature engineering and the factor scoring layer, walk-forward backtested against your existing research infra. Baseline signal generator that your bench can take, replace, or extend.
  • Attribution. The data joins that let post-hoc analysis distinguish signal from market beta, sector beta, and the four or five factors the signal was supposed to be orthogonal to.

How it's built

Default stack: Python (Polars / Pandas for hot paths, DuckDB for ad-hoc), Postgres or your fund's existing warehouse for canonical storage, Prefect or Airflow for orchestration. ML modeling in scikit-learn / LightGBM, with a slim PyTorch layer when neural feature extraction is on-thesis (it usually isn't for tabular alt-data). The stack adapts — the bar doesn't.

What you get

  • A production pipeline. Wired to your orchestrator. Idempotent.
  • Backfilled history to the depth required for walk-forward.
  • A baseline signal generator with documented hyperparameters and tracked performance.
  • Attribution data joins.
  • Runbooks for the ops your team will own after handoff.
§ How we engage

Engagement is shape, not list.

Length and price are functions of the data and the destination. The shape below is the typical engagement.

Length
8–14 week build · ongoing data ops

Scoped during the discovery call against the actual data and the operation it integrates with.

Lead
Bogdan

Principal engineer. Architecture and most code ships through one keyboard.

Cadence
Async, weekly

Written updates between, calls when the decision needs the room.

Bar
Production

Async correctness, capacity under burst, observability at every boundary.

§ Questions

What buyers ask about this one.

  • We already have a data engineering team. What's different about this engagement?

    If your team has the bandwidth and the alt-data plumbing reps, this isn't the right fit. We come in when there's a specific dataset and a specific signal hypothesis but no clean path from one to the other — and the team you'd normally lean on is busy keeping the core feeds healthy. We deliver the path, then hand the operation back.

  • What datasets have you actually integrated?

    Credit-card transaction panels (consumer and trade-credit), satellite imagery (vegetation indices, parking-lot counts), shipping manifests, foot-traffic, a handful of vertical-specific feeds we won't name. The integration shape is the same regardless — schema validation, idempotency, replay, backfill — so a dataset we haven't shipped doesn't change the engagement materially.

  • Do you produce the signals or just the pipeline?

    Both. The default deliverable is the pipeline plus a baseline signal generator that walks forward against your backtest infra. Funds with a research bench take the pipeline and replace the signal layer with their own. Funds without one keep ours.

  • What's the relationship to the Trade-Credit Score and Foot-Traffic Dashboard products?

    Those are ready-made data products — subscribe, get the score. This engine is the custom build for funds that want a unique dataset integrated end-to-end. The two patterns coexist on purpose: pre-built where the dataset is shared, custom where the edge requires it isn't.

  • Pricing?

    Scoped against the dataset, the historical depth required, and the destination. The discovery call covers both. No retainer.

§ The next step

If the deliverable matches the gap, the next step is one call.

We'll scope length and price against your data and the operation it integrates with. No retainer, no fishing.

Bogdan and team · async-first · OP—2026