|
|
|
|
|
by roel_v
4789 days ago
|
|
"
* HTML semantics are non-existent These are all relatively easy to fix, I believe.
" How? For example, how would you identify <span>'s (or whatever this converter uses) to identify headers, and page headers/footers, or a ToC, or a preface? IMO this is an AI-hard problem, for which even the 'simple' approximation (statistics) is very hard due to the wide variety in inputs (a corpus trained for multi-column journal articles will most likely not work at all for books, although I haven't tried and would love to be proven wrong). Use case: a working (i.e., preserving semantics) pdf-to-epub converter. This would, imho, be a killer product / service. |
|