Vietnamese audio, video, and broadcast archives transcribed, scene-segmented, and metadata-tagged for retrieval. Diacritics corrected by Vietnamese-native annotators. Pilot CER under three percent on news content before human correction.
For library indexing, internal search, and any downstream use that needs the source material structured but not yet enriched.
Output formats · JSONL · SRT · VTT · Parquet
…bản tin chưa đượcphân loại …vùngngày phát sóng?…trùng lặp dữ liệuVTV VOV ???
→
VTV24ORG đưa tin tại Nha TrangGEO ngày 14·03·2026DATE
topic
public-health
region
South-Central
rag_ready
true
Tier 2
LLM Training Datasets
Tier 1 plus structured entities, topic classification, and RAG-ready passages with provenance metadata. Tuned for pretraining and fine-tuning of Vietnamese language models.
We retain only the methodology. The datasets, taxonomy, and edge-case handling are yours under contract.
Tier 2 plus expert validation by independent Vietnamese subject-matter experts. Source-grounded question-and-answer pairs with full citation chains. Contentious historical content cross-validated by two experts.
Built for LLM evaluation, benchmark publication, and any setting where the cost of an unverified answer is high.