I guess one of the bigger challenges when it comes to unstructured data is identifying URLs. Is there a canonical way of identifying a URL embedded in text? Is it an impossible problem?
This regexp lets you parse a valid URI, but it matches also a lot things at are not URIs at all.
The URI language is of course regular, so it would be possible to construct a regexp that matches only URIs. But naively applying such regexp wouldn't work in practice, because many punctuation characters are allowed in URIs. For example, single quotes are allowed, so in this Python code the regexp would match too much:
I see—The standard allows most characters we normally use to surround URIs. It sure does look like a difficult problem then, and one that a regexp can't solve.
https://tools.ietf.org/html/rfc3986#appendix-B
Edit: I see it's not quite that simple. However, I still think that with some stricter matching requirements this could work.