If you're trying to handle text "in the wild" and not scanned documents, the keyword is "scene text". Most papers are focused on either detection/localization, i.e. finding the location of text, or recognition, i.e. recognizing the actual content given a cropped text image.
Here are some current state-of-the-art papers + code where available about detection:
Note that this paper is from 2010 and thus, while quite influential for its time far from the current state-of-the-art. The stroke width transform method that it introduced is simply not as good as current deep learning-based methods.
If you want to get a (slightly out of date but what can you do, the field is moving very fast) overview see this survey from 2016: