Cross-Lingual News & Filings Intelligence
Intraday sentiment scores and event extractions for tickers, sectors, and macro themes, sourced from non-English news, regulatory filings, and social — covering the languages where your existing stack goes dark.
- Engagement
- 6–10 week build · monthly model retraining
- Built for
- Macro PMs · Event-driven PMs · Multi-strat research leads
RavenPack, Bloomberg, AlphaSense — every fund has English-language news sentiment. The edge is the Chinese filings, the Brazilian regulator notices, the Japanese local press picked up three days before the wire.
What this is
A sentiment and event-extraction layer for the languages your English-language stack can't read. Three layers:
- Source ingestion. Per-language sources — business press, regulatory filings (CSRC, KRX, HKEX, etc.), select social — wired into a normalized pipeline with deduplication and source-level reliability scoring.
- Per-language NER + scoring. Language-specific named-entity recognition for the names that matter to your universe — tickers, executives, regulators, brands. Sentiment and event classification with models fine-tuned per language, not translated-then-scored.
- Delivery. Intraday scores and event-time alerts via API. Snippets surfaced alongside scores so an analyst can audit the source.
How it's built
For each supported language: an NER model fine-tuned on financial-domain text, an event-classification head, a sentiment head, and a source-reliability scoring layer. Backbone: a multilingual transformer per language (XLM-RoBERTa-class for most, language-specific where the data justifies). Inference served behind a thin FastAPI layer. Backtest infra runs against an internal labeled set — extended each engagement against the fund's universe.
What you get
- The source list curated for your universe.
- A scoring API — intraday sentiment, event classification, entity-level scores.
- Snippets surfaced alongside scores (translated for the analyst, scored in source).
- Backtest infra tied to your existing research stack.
- A monthly model-retraining schedule and the labeled-data process behind it.
Engagement is shape, not list.
Length and price are functions of the data and the destination. The shape below is the typical engagement.
- Length
- 6–10 week build · monthly model retraining
- Lead
- Bogdan
- Cadence
- Async, weekly
- Bar
- Production
Scoped during the discovery call against the actual data and the operation it integrates with.
Principal engineer. Architecture and most code ships through one keyboard.
Written updates between, calls when the decision needs the room.
Async correctness, capacity under burst, observability at every boundary.
Products this composes with.
Same suite, or vertical-specialized versions in another.
- Same suite · Hedge Fund Suite
Alternative Data Signal Engine
A production pipeline that ingests one or more unconventional datasets, normalizes against a fund-internal schema, and serves processed factor scores, back-tested signals, and event-time alerts to the research stack.
- Same suite · Hedge Fund Suite
Prediction-Market Alpha Layer
A clean feed of prediction-market probabilities mapped to your existing macro and event-driven framework — Fed move probabilities, geopolitical risk markers, election-implied probabilities, joined to the equity sector exposures and macro positions they should influence.
What buyers ask about this one.
We have RavenPack. Why would we add this?
RavenPack is English-first. The serious edge sits where everyone else is dark — Chinese consumer brands picked up on Weibo, Brazilian commodity exporters covered in Portuguese press, Korean conglomerate filings in Korean. The product is built to cover those gaps, not duplicate the English coverage you already pay for.
Which languages are supported?
Chinese (Simplified + Traditional), Japanese, Korean, Portuguese, Spanish, Russian, German, French — depth varies. The engagement starts by picking the two or three you actually trade in and going deep, rather than spreading thin. The named-entity recognition layer is the bottleneck for each new language, not the translation.
What's the source list?
Per-language curated. The first engagement defines it against the universe you trade. For Chinese: top business press (Caixin, 21st Century Business Herald), regulatory filings (CSRC, HKEX), Weibo and Zhihu signal where useful. Per-language equivalents elsewhere. The curation matters more than the breadth.
How do you handle translation drift between source and signal?
We don't translate then score. We score in the source language with a language-specific model, then surface the relevant snippet (translated) alongside the score so the analyst can audit. Translation as evidence, not as input.
Pricing?
Scoped against language depth and source-list size. Discovery call covers both.
If the deliverable matches the gap, the next step is one call.
We'll scope length and price against your data and the operation it integrates with. No retainer, no fishing.
Bogdan and team · async-first · OP—2026