19/04/2026
Nepal has decades of government records — land titles, court judgments, budget reports, economics analyses, gazette notices, and census data — stored as PDFs. Most of these documents use legacy Nepali fonts (Preeti, Kantipur, Sagarmatha) that store Devanagari characters as ASCII bytes. When any standard extraction tool reads the text layer, it returns garbage: g]kfn instead of नेपाल. No error is raised, no warning emitted — the output is silently wrong. Other documents are scanned images with no text layer at all. And even when characters are extracted correctly, mainstream tools have no understanding of Nepal’s document schemas: Bikram Sambat dates, ward and VDC hierarchies, kittaa land parcel notation, or NPR currency.
LamiSema solves this end-to-end: detecting encoding type first, routing to the correct extraction strategy, recovering document structure, and extracting structured meaning using Nepal’s own administrative and legal vocabulary.
Learn more: https://lamisema.readthedocs.io/en/latest/