I Tried Sarvam Vision: Is It the Best Model for Extracting Nepali Text?
SR
Sabin Ranabhat
Published March 18, 2026 • Updated March 19, 2026•5 min read

Back Story
I was building a RAG workflow for our non-profit organization.
Most of our source files are image-based PDFs, so I needed reliable OCR before generating embeddings. Since this is a non-profit project, cost was also a major concern, and my first preference was a free and open-source approach.
Problem
Like many people, I started with Tesseract and its Python wrapper, pytesseract. I also tested PaddleOCR, another popular open-source alternative.
I tried PaddleOCR-VL-1.5, which was fast but gave incorrect results that were almost complete gibberish. I also tried PaddleOCR-VL, which was slow and took a while, but gave far better results than VL-1.5.
Even though PaddleOCR-VL was much better than VL-1.5, the output was still not consistent enough for my production-style Nepali RAG use case.
That is where Sarvam AI came in. I chose it because it gave me more consistent extraction, plus image-context explanations and structured metadata that were easier to use in my RAG pipeline.
Introduction to Sarvam
Sarvam Vision is Sarvam AI's document intelligence model for extracting text and structure from PDFs, scans, and image-based documents. If your workflow involves multilingual records, government forms, academic material, or historical archives, this is one of the more relevant India-first models to keep on your radar.
What is Sarvam Vision? Sarvam Vision is a 3B parameter vision-language model that powers Sarvam AI's document intelligence pipeline. According to Sarvam's public documentation, it is designed for high-accuracy OCR, layout preservation, and structured extraction across 22 Indian languages plus English.
Model Specifications
- Model size: 3B parameters
- Input formats: PDF, PNG, JPG, ZIP
- Output formats: HTML and Markdown
- Delivery format: Processed output is delivered as a ZIP file
- Languages: 23 languages (22 Indian + English)
Note: Sarvam's docs describe language support as 22 Indian languages plus English, and Nepali is also supported with ne-IN. I am still not sure why Nepali is grouped under Indian languages in their wording.
Solution
The approach was simple: use Sarvam Vision for OCR.
The only constraint I faced was PDF size per request (10 pages at a time in my testing). So I split a larger PDF into chunks, processed each chunk separately, and then merged the outputs.
I tested this pipeline on a 44-page sample document.
All tests in this post were run in March 2026.
Workflow I Used
- Split one large PDF into 10-page chunks.
- Run document intelligence on each chunk with
language="ne-IN"andoutput_format="md". - Download each
output.zipand rename it per chunk. - Merge extracted Markdown (or JSON) for downstream embedding.
Code:
from sarvamai import SarvamAI
from PyPDF2 import PdfReader, PdfWriter
import os
def split_pdf(input_path, output_dir, chunk_size=10):
reader = PdfReader(input_path)
total_pages = len(reader.pages)
os.makedirs(output_dir, exist_ok=True)
file_paths = []
for start in range(0, total_pages, chunk_size):
writer = PdfWriter()
end = min(start + chunk_size, total_pages)
for i in range(start, end):
writer.add_page(reader.pages[i])
output_path = os.path.join(output_dir, f"chunk_{start+1}_to_{end}.pdf")
with open(output_path, "wb") as f:
writer.write(f)
file_paths.append(output_path)
return file_paths
client = SarvamAI(api_subscription_key="your-api-token")
chunks = split_pdf("/path/to/document.pdf", "chunks")
for i, chunk in enumerate(chunks):
job = client.document_intelligence.create_job(
language="ne-IN",
output_format="md"
)
print(f"Job created: {job.job_id}")
job.upload_file(chunk)
print(f"File uploaded: {chunk}")
job.start()
print("Job started")
status = job.wait_until_complete()
print(f"Job completed with state: {status.job_state}")
metrics = job.get_page_metrics()
print(f"Page metrics: {metrics}")
job.download_output("./output.zip")
os.rename("./output.zip", f"./output_{i}.zip")
print(f"Output saved to ./output_{i}.zip")
Note: In the sample code above, I skipped the final merge step to keep the example focused. You can find code here: https://gist.github.com/sawin0/34a69847983058e7790fc40a318ee9db
Result
Each output.zip contains:
document.mdwith extracted content for the chunkmetadata/JSON files with block-level details
{
"page_num": 1,
"image_width": 2481,
"image_height": 3507,
"created_at": "2026-03-18T07:15:03.919966+00:00",
"blocks": [
{
"block_id": "20260318_ed3141f3-f2a8-4d3f-a24e-330d8c17bdfb_1_block_000",
"coordinates": {
"x1": 95.2095718383789,
"y1": 18.83642578125,
"x2": 2471.339111328125,
"y2": 729.4833984375
},
"layout_tag": "header",
"confidence": 0.8720703125,
"reading_order": 1,
"text": "श्री स्वामी मागत समिति, दिव्य\nआध्यात्मिक सन्देश\nशरद्पूर्णिमा विशेषाङ्क\nम.क्षे.हु.नि.द.नं. ०७/०६७/६८\nदर्ता नं. ५६-०६२/६३\nवर्ष २१, (पूर्णाङ्क १९२) E-mail: shyamashyamdhamthimi@gmail.com / Website : www.divineclubworldwide.org २०८२"
},
{
"block_id": "20260318_ed3141f3-f2a8-4d3f-a24e-330d8c17bdfb_1_block_001",
"coordinates": {
"x1": 108.0,
"y1": 732.0,
"x2": 2481.0,
"y2": 3335.0
},
"layout_tag": "image",
"confidence": 0.8528319001197815,
"reading_order": 2,
"text": "यो तस्बिरमा एक व्यक्ति ध्यान मुद्रामा बसेको देखिन्छ। उनले लामो कपाल राखेका छन् र घाँटीमा फूलको माला लगाएका छन्। उनको घाँटीमा एउटा सानो टिक्ली पनि देखिन्छ। पृष्ठभूमिमा एउटा उज्यालो घेरा छ, जसले उनीमाथि एक प्रकारको आभा सिर्जना गरेको छ। यो तस्बिरको रङ सेपिया (sepia) छ, जसले यसलाई पुरानो शैलीको देखाउँछ।"
}
]
}
In my testing, there were still a few incorrect words, but overall extraction quality was around 95%+ for the full PDF. Another useful capability is image-region explanation in the metadata output, which I found especially valuable.
Usage and Cost
After extraction, I checked the Sarvam dashboard:
- Total spend shown:
₹111for 74 pages - Effective cost/page shown:
~₹1.5/page
I am not sure why the dashboard showed 74 pages while my sample had 44 pages, so that is something to validate on your side during testing.
Later, I found the current pricing note in Sarvam's docs for Document Intelligence API (see Sarvam pricing for Document Intelligence):
- Free:
₹0/page - Status: currently free to use
So, treat the dashboard spend in this experiment as an observed usage metric, not necessarily a final billed amount. In my run, I did not pay out of pocket.
My Take
- If you are working with image-based Nepali documents, Sarvam Vision is worth trying.
- Accuracy was good enough for practical RAG ingestion in my case.
- Sarvam Vision explains image context, which can be a great addition for RAG.
- If you want a free and open-source solution, go with PaddleOCR-VL. It gave around 90% accuracy for me and is still an awesome baseline.
- As of this writing, Document Intelligence is listed as free (
₹0/page) in the docs, but always verify the latest pricing before production use.
References
Share this article:
Related Posts

Blog
December 31, 2025
The Art of Flutter Code Review: A Guide for Reviewers & Authors
Master the art of code review in Flutter. Learn how to balance human factors with technical rigor, ensuring code quality without sacrificing team morale.

Blog
December 31, 2025
Mastering Flutter Pull Requests: Best Practices & Standards
Learn best practices for submitting high-quality Flutter pull requests, including writing clear titles, crafting strong descriptions, and adhering to community standards.

Blog
December 16, 2025
Conventional Commits: A Guide to Meaningful Git History
Instead of writing vague messages like 'fixed bug' or 'updates', this convention provides a rigorous rule set for creating an explicit commit history. This makes it easier to understand *what* happened in a project and *why*, and it enables potent automation tools (like automatic changelogs and version bumping).

Blog
February 22, 2024
Optimize Flutter Performance: Handle Heavy Image with Ease
Learn how to handle heavy images in Flutter using debugInvertOversizedImages and ResizeImage to optimize performance and memory usage.
© 2026 Sabin Ranabhat