I Tried Sarvam Vision: Is It the Best Model for Extracting Nepali Text?

Back Story

I was building a RAG workflow for our non-profit organization.

Most of our source files are image-based PDFs, so I needed reliable OCR before generating embeddings. Since this is a non-profit project, cost was also a major concern, and my first preference was a free and open-source approach.

Problem

Like many people, I started with Tesseract and its Python wrapper, pytesseract. I also tested PaddleOCR, another popular open-source alternative.

I tried PaddleOCR-VL-1.5, which was fast but gave incorrect results that were almost complete gibberish. I also tried PaddleOCR-VL, which was slow and took a while, but gave far better results than VL-1.5.

Even though PaddleOCR-VL was much better than VL-1.5, the output was still not consistent enough for my production-style Nepali RAG use case.

That is where Sarvam AI came in. I chose it because it gave me more consistent extraction, plus image-context explanations and structured metadata that were easier to use in my RAG pipeline.

Introduction to Sarvam

Sarvam Vision is Sarvam AI's document intelligence model for extracting text and structure from PDFs, scans, and image-based documents. If your workflow involves multilingual records, government forms, academic material, or historical archives, this is one of the more relevant India-first models to keep on your radar.

What is Sarvam Vision? Sarvam Vision is a 3B parameter vision-language model that powers Sarvam AI's document intelligence pipeline. According to Sarvam's public documentation, it is designed for high-accuracy OCR, layout preservation, and structured extraction across 22 Indian languages plus English.

Model Specifications

Model size: 3B parameters
Input formats: PDF, PNG, JPG, ZIP
Output formats: HTML and Markdown
Delivery format: Processed output is delivered as a ZIP file
Languages: 23 languages (22 Indian + English)

Note: Sarvam's docs describe language support as 22 Indian languages plus English, and Nepali is also supported with ne-IN. I am still not sure why Nepali is grouped under Indian languages in their wording.

Solution

The approach was simple: use Sarvam Vision for OCR.

The only constraint I faced was PDF size per request (10 pages at a time in my testing). So I split a larger PDF into chunks, processed each chunk separately, and then merged the outputs.

I tested this pipeline on a 44-page sample document.

All tests in this post were run in March 2026.

Workflow I Used

Split one large PDF into 10-page chunks.
Run document intelligence on each chunk with language="ne-IN" and output_format="md".
Download each output.zip and rename it per chunk.
Merge extracted Markdown (or JSON) for downstream embedding.

Code:

from sarvamai import SarvamAI
from PyPDF2 import PdfReader, PdfWriter
import os


def split_pdf(input_path, output_dir, chunk_size=10):
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)

    os.makedirs(output_dir, exist_ok=True)

    file_paths = []

    for start in range(0, total_pages, chunk_size):
        writer = PdfWriter()
        end = min(start + chunk_size, total_pages)

        for i in range(start, end):
            writer.add_page(reader.pages[i])

        output_path = os.path.join(output_dir, f"chunk_{start+1}_to_{end}.pdf")

        with open(output_path, "wb") as f:
            writer.write(f)

        file_paths.append(output_path)

    return file_paths


client = SarvamAI(api_subscription_key="your-api-token")


chunks = split_pdf("/path/to/document.pdf", "chunks")


for i, chunk in enumerate(chunks):
    job = client.document_intelligence.create_job(
        language="ne-IN",
        output_format="md"
    )
    print(f"Job created: {job.job_id}")

    job.upload_file(chunk)
    print(f"File uploaded: {chunk}")

    job.start()
    print("Job started")

    status = job.wait_until_complete()
    print(f"Job completed with state: {status.job_state}")

    metrics = job.get_page_metrics()
    print(f"Page metrics: {metrics}")

    job.download_output("./output.zip")
    os.rename("./output.zip", f"./output_{i}.zip")
    print(f"Output saved to ./output_{i}.zip")

Note: In the sample code above, I skipped the final merge step to keep the example focused. You can find code here: https://gist.github.com/sawin0/34a69847983058e7790fc40a318ee9db

Result

Each output.zip contains:

document.md with extracted content for the chunk
metadata/ JSON files with block-level details

{
  "page_num": 1,
  "image_width": 2481,
  "image_height": 3507,
  "created_at": "2026-03-18T07:15:03.919966+00:00",
  "blocks": [
    {
      "block_id": "20260318_ed3141f3-f2a8-4d3f-a24e-330d8c17bdfb_1_block_000",
      "coordinates": {
        "x1": 95.2095718383789,
        "y1": 18.83642578125,
        "x2": 2471.339111328125,
        "y2": 729.4833984375
      },
      "layout_tag": "header",
      "confidence": 0.8720703125,
      "reading_order": 1,
      "text": "श्री स्वामी मागत समिति, दिव्य\nआध्यात्मिक सन्देश\nशरद्पूर्णिमा विशेषाङ्क\nम.क्षे.हु.नि.द.नं. ०७/०६७/६८\nदर्ता नं. ५६-०६२/६३\nवर्ष २१, (पूर्णाङ्क १९२) E-mail: shyamashyamdhamthimi@gmail.com / Website : www.divineclubworldwide.org २०८२"
    },
    {
      "block_id": "20260318_ed3141f3-f2a8-4d3f-a24e-330d8c17bdfb_1_block_001",
      "coordinates": {
        "x1": 108.0,
        "y1": 732.0,
        "x2": 2481.0,
        "y2": 3335.0
      },
      "layout_tag": "image",
      "confidence": 0.8528319001197815,
      "reading_order": 2,
      "text": "यो तस्बिरमा एक व्यक्ति ध्यान मुद्रामा बसेको देखिन्छ। उनले लामो कपाल राखेका छन् र घाँटीमा फूलको माला लगाएका छन्। उनको घाँटीमा एउटा सानो टिक्ली पनि देखिन्छ। पृष्ठभूमिमा एउटा उज्यालो घेरा छ, जसले उनीमाथि एक प्रकारको आभा सिर्जना गरेको छ। यो तस्बिरको रङ सेपिया (sepia) छ, जसले यसलाई पुरानो शैलीको देखाउँछ।"
    }
  ]
}

In my testing, there were still a few incorrect words, but overall extraction quality was around 95%+ for the full PDF. Another useful capability is image-region explanation in the metadata output, which I found especially valuable.

Usage and Cost

After extraction, I checked the Sarvam dashboard:

Total spend shown: ₹111 for 74 pages
Effective cost/page shown: ~₹1.5/page

I am not sure why the dashboard showed 74 pages while my sample had 44 pages, so that is something to validate on your side during testing.

Later, I found the current pricing note in Sarvam's docs for Document Intelligence API (see Sarvam pricing for Document Intelligence):

Free: ₹0/page
Status: currently free to use

So, treat the dashboard spend in this experiment as an observed usage metric, not necessarily a final billed amount. In my run, I did not pay out of pocket.

My Take

If you are working with image-based Nepali documents, Sarvam Vision is worth trying.
Accuracy was good enough for practical RAG ingestion in my case.
Sarvam Vision explains image context, which can be a great addition for RAG.
If you want a free and open-source solution, go with PaddleOCR-VL. It gave around 90% accuracy for me and is still an awesome baseline.
As of this writing, Document Intelligence is listed as free (₹0/page) in the docs, but always verify the latest pricing before production use.

References

Sarvam Vision model page
Sarvam AI models overview
Sarvam pricing for Document Intelligence
Sarvam AI platform
Cover image attribution: Optical Vectors by Vecteezy

Back Story

I was building a RAG workflow for our non-profit organization.

Problem

Like many people, I started with Tesseract and its Python wrapper, pytesseract. I also tested PaddleOCR, another popular open-source alternative.

Even though PaddleOCR-VL was much better than VL-1.5, the output was still not consistent enough for my production-style Nepali RAG use case.

That is where Sarvam AI came in. I chose it because it gave me more consistent extraction, plus image-context explanations and structured metadata that were easier to use in my RAG pipeline.

Introduction to Sarvam

What is Sarvam Vision? Sarvam Vision is a 3B parameter vision-language model that powers Sarvam AI's document intelligence pipeline. According to Sarvam's public documentation, it is designed for high-accuracy OCR, layout preservation, and structured extraction across 22 Indian languages plus English.

Model Specifications

Model size: 3B parameters
Input formats: PDF, PNG, JPG, ZIP
Output formats: HTML and Markdown
Delivery format: Processed output is delivered as a ZIP file
Languages: 23 languages (22 Indian + English)

Solution

The approach was simple: use Sarvam Vision for OCR.

The only constraint I faced was PDF size per request (10 pages at a time in my testing). So I split a larger PDF into chunks, processed each chunk separately, and then merged the outputs.

I tested this pipeline on a 44-page sample document.

All tests in this post were run in March 2026.

Workflow I Used

Split one large PDF into 10-page chunks.
Run document intelligence on each chunk with language="ne-IN" and output_format="md".
Download each output.zip and rename it per chunk.
Merge extracted Markdown (or JSON) for downstream embedding.

Code:

from sarvamai import SarvamAI
from PyPDF2 import PdfReader, PdfWriter
import os


def split_pdf(input_path, output_dir, chunk_size=10):
    reader = PdfReader(input_path)
    total_pages = len(reader.pages)

    os.makedirs(output_dir, exist_ok=True)

    file_paths = []

    for start in range(0, total_pages, chunk_size):
        writer = PdfWriter()
        end = min(start + chunk_size, total_pages)

        for i in range(start, end):
            writer.add_page(reader.pages[i])

        output_path = os.path.join(output_dir, f"chunk_{start+1}_to_{end}.pdf")

        with open(output_path, "wb") as f:
            writer.write(f)

        file_paths.append(output_path)

    return file_paths


client = SarvamAI(api_subscription_key="your-api-token")


chunks = split_pdf("/path/to/document.pdf", "chunks")


for i, chunk in enumerate(chunks):
    job = client.document_intelligence.create_job(
        language="ne-IN",
        output_format="md"
    )
    print(f"Job created: {job.job_id}")

    job.upload_file(chunk)
    print(f"File uploaded: {chunk}")

    job.start()
    print("Job started")

    status = job.wait_until_complete()
    print(f"Job completed with state: {status.job_state}")

    metrics = job.get_page_metrics()
    print(f"Page metrics: {metrics}")

    job.download_output("./output.zip")
    os.rename("./output.zip", f"./output_{i}.zip")
    print(f"Output saved to ./output_{i}.zip")

Note: In the sample code above, I skipped the final merge step to keep the example focused. You can find code here: https://gist.github.com/sawin0/34a69847983058e7790fc40a318ee9db

Result

Each output.zip contains:

document.md with extracted content for the chunk
metadata/ JSON files with block-level details

{
  "page_num": 1,
  "image_width": 2481,
  "image_height": 3507,
  "created_at": "2026-03-18T07:15:03.919966+00:00",
  "blocks": [
    {
      "block_id": "20260318_ed3141f3-f2a8-4d3f-a24e-330d8c17bdfb_1_block_000",
      "coordinates": {
        "x1": 95.2095718383789,
        "y1": 18.83642578125,
        "x2": 2471.339111328125,
        "y2": 729.4833984375
      },
      "layout_tag": "header",
      "confidence": 0.8720703125,
      "reading_order": 1,
      "text": "श्री स्वामी मागत समिति, दिव्य\nआध्यात्मिक सन्देश\nशरद्पूर्णिमा विशेषाङ्क\nम.क्षे.हु.नि.द.नं. ०७/०६७/६८\nदर्ता नं. ५६-०६२/६३\nवर्ष २१, (पूर्णाङ्क १९२) E-mail: shyamashyamdhamthimi@gmail.com / Website : www.divineclubworldwide.org २०८२"
    },
    {
      "block_id": "20260318_ed3141f3-f2a8-4d3f-a24e-330d8c17bdfb_1_block_001",
      "coordinates": {
        "x1": 108.0,
        "y1": 732.0,
        "x2": 2481.0,
        "y2": 3335.0
      },
      "layout_tag": "image",
      "confidence": 0.8528319001197815,
      "reading_order": 2,
      "text": "यो तस्बिरमा एक व्यक्ति ध्यान मुद्रामा बसेको देखिन्छ। उनले लामो कपाल राखेका छन् र घाँटीमा फूलको माला लगाएका छन्। उनको घाँटीमा एउटा सानो टिक्ली पनि देखिन्छ। पृष्ठभूमिमा एउटा उज्यालो घेरा छ, जसले उनीमाथि एक प्रकारको आभा सिर्जना गरेको छ। यो तस्बिरको रङ सेपिया (sepia) छ, जसले यसलाई पुरानो शैलीको देखाउँछ।"
    }
  ]
}

Usage and Cost

After extraction, I checked the Sarvam dashboard:

Total spend shown: ₹111 for 74 pages
Effective cost/page shown: ~₹1.5/page

I am not sure why the dashboard showed 74 pages while my sample had 44 pages, so that is something to validate on your side during testing.

Later, I found the current pricing note in Sarvam's docs for Document Intelligence API (see Sarvam pricing for Document Intelligence):

Free: ₹0/page
Status: currently free to use

So, treat the dashboard spend in this experiment as an observed usage metric, not necessarily a final billed amount. In my run, I did not pay out of pocket.

My Take

If you are working with image-based Nepali documents, Sarvam Vision is worth trying.
Accuracy was good enough for practical RAG ingestion in my case.
Sarvam Vision explains image context, which can be a great addition for RAG.
If you want a free and open-source solution, go with PaddleOCR-VL. It gave around 90% accuracy for me and is still an awesome baseline.
As of this writing, Document Intelligence is listed as free (₹0/page) in the docs, but always verify the latest pricing before production use.

References

Sarvam Vision model page
Sarvam AI models overview
Sarvam pricing for Document Intelligence
Sarvam AI platform
Cover image attribution: Optical Vectors by Vecteezy

I Tried Sarvam Vision: Is It the Best Model for Extracting Nepali Text?

Back Story

Problem

Introduction to Sarvam

Model Specifications

Solution

Workflow I Used

Code:

Result

Usage and Cost

My Take

References

Related Posts

The Art of Flutter Code Review: A Guide for Reviewers & Authors

Mastering Flutter Pull Requests: Best Practices & Standards

Conventional Commits: A Guide to Meaningful Git History

Optimize Flutter Performance: Handle Heavy Image with Ease

I Tried Sarvam Vision: Is It the Best Model for Extracting Nepali Text?

Back Story

Problem

Introduction to Sarvam

Model Specifications

Solution

Workflow I Used

Code:

Result

Usage and Cost

My Take

References

Related Posts

The Art of Flutter Code Review: A Guide for Reviewers & Authors

Mastering Flutter Pull Requests: Best Practices & Standards

Conventional Commits: A Guide to Meaningful Git History

Optimize Flutter Performance: Handle Heavy Image with Ease