Alternative Data Signal Engine
A production pipeline that ingests one or more unconventional datasets, normalizes against a fund-internal schema, and serves processed factor scores, back-tested signals, and event-time alerts to the research stack.
- Engagement
- 8–14 week build · ongoing data ops
- Built for
- Quant PMs · CIOs · Research bench leads
The data you want lives outside the canonical feeds, and the pipeline that ingests it cleanly — point-in-time, backfilled, idempotent — doesn't yet exist.
What this is
A custom pipeline for funds that have identified a specific alt-data signal hypothesis and need the production-grade plumbing to test, deploy, and operate against it. The engagement covers four bands:
- Ingestion. Schema validation at the boundary, idempotent writes, replay from any checkpoint, backfill of the historical depth your backtest infra needs.
- Normalization. Point-in-time alignment (no look-ahead), survivorship correction, vendor-format → fund-internal schema mapping. Universe resolution against your master security file.
- Modeling. Feature engineering and the factor scoring layer, walk-forward backtested against your existing research infra. Baseline signal generator that your bench can take, replace, or extend.
- Attribution. The data joins that let post-hoc analysis distinguish signal from market beta, sector beta, and the four or five factors the signal was supposed to be orthogonal to.
How it's built
Default stack: Python (Polars / Pandas for hot paths, DuckDB for ad-hoc), Postgres or your fund's existing warehouse for canonical storage, Prefect or Airflow for orchestration. ML modeling in scikit-learn / LightGBM, with a slim PyTorch layer when neural feature extraction is on-thesis (it usually isn't for tabular alt-data). The stack adapts — the bar doesn't.
What you get
- A production pipeline. Wired to your orchestrator. Idempotent.
- Backfilled history to the depth required for walk-forward.
- A baseline signal generator with documented hyperparameters and tracked performance.
- Attribution data joins.
- Runbooks for the ops your team will own after handoff.
Engagement is shape, not list.
Length and price are functions of the data and the destination. The shape below is the typical engagement.
- Length
- 8–14 week build · ongoing data ops
- Lead
- Bogdan
- Cadence
- Async, weekly
- Bar
- Production
Scoped during the discovery call against the actual data and the operation it integrates with.
Principal engineer. Architecture and most code ships through one keyboard.
Written updates between, calls when the decision needs the room.
Async correctness, capacity under burst, observability at every boundary.
Products this composes with.
Same suite, or vertical-specialized versions in another.
- Same suite · Hedge Fund Suite
Trade-Credit & Supply-Chain Score
A monthly score and supporting attribution data on every covered public company — vendor payment cadence, trade-credit balance trajectory, supplier-concentration risk, supply-chain network deltas. Delivered via API and SFTP.
- Same suite · Hedge Fund Suite
Consumer Spending & Foot-Traffic Dashboard
Weekly dashboards combining anonymized card spend (US consumer) and foot-traffic (mapped to ticker via store-location databases) — earnings-window trend detection for the names and themes consumer funds trade.
- Same suite · Hedge Fund Suite
Cross-Lingual News & Filings Intelligence
Intraday sentiment scores and event extractions for tickers, sectors, and macro themes, sourced from non-English news, regulatory filings, and social — covering the languages where your existing stack goes dark.
What buyers ask about this one.
We already have a data engineering team. What's different about this engagement?
If your team has the bandwidth and the alt-data plumbing reps, this isn't the right fit. We come in when there's a specific dataset and a specific signal hypothesis but no clean path from one to the other — and the team you'd normally lean on is busy keeping the core feeds healthy. We deliver the path, then hand the operation back.
What datasets have you actually integrated?
Credit-card transaction panels (consumer and trade-credit), satellite imagery (vegetation indices, parking-lot counts), shipping manifests, foot-traffic, a handful of vertical-specific feeds we won't name. The integration shape is the same regardless — schema validation, idempotency, replay, backfill — so a dataset we haven't shipped doesn't change the engagement materially.
Do you produce the signals or just the pipeline?
Both. The default deliverable is the pipeline plus a baseline signal generator that walks forward against your backtest infra. Funds with a research bench take the pipeline and replace the signal layer with their own. Funds without one keep ours.
What's the relationship to the Trade-Credit Score and Foot-Traffic Dashboard products?
Those are ready-made data products — subscribe, get the score. This engine is the custom build for funds that want a unique dataset integrated end-to-end. The two patterns coexist on purpose: pre-built where the dataset is shared, custom where the edge requires it isn't.
Pricing?
Scoped against the dataset, the historical depth required, and the destination. The discovery call covers both. No retainer.
If the deliverable matches the gap, the next step is one call.
We'll scope length and price against your data and the operation it integrates with. No retainer, no fishing.
Bogdan and team · async-first · OP—2026