| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nhirschfeld 531 days ago

Thanks for asking!

It's both. The OCR part is ofc CPU bound, but the entire text extraction involves reading files, or writing and then reading files.

Without async, these simply block.

As for efficiency - if you're working in an async application context you have to "asyncify" these operations or suffer the consequences.

1 comments

skavi 530 days ago

in that case, what’s the deal with extract_bytes being async? i’m not incredibly familiar with python, but i’d expect a “byte string” to be in memory.

link

nhirschfeld 530 days ago

You still need to write it to file to process it via pandoc/tesseract etc.

There are alternative options to tesseract ofc.

link

LoganDark 530 days ago

> You still need to write it to file to process it via pandoc/tesseract etc.

This sounds... I guess Pythonic? Sheesh.

link