Lets cut to the chase: PDFs are evil. If you've done any type of work with PDFs--be it reading them, parsing their text, OCR, etc,--you know what I mean.

An xkcd comic poking fun at PDFS

After pausing on the Authtable.com project for a while, I came back to it recently only to realize the layout of text in documents has completely changed, breaking almost all of the text parsing/extraction functions I implemented (and the various features built on top of them). In the real world--with actual, paying customers--this would lead to hours, maybe days, of downtime.

As anyone familiar with DDD supports in NJ knows, these dang PDFs don't read top-bottom-left-right (or is it left-right-top-down? Eh, you know what I mean). Long text shifts from a column on one page to the same column on the next page. Text gets literally split in half by random line-breaks. I could go on. These changes have made me question the viability of continuing to work on this project. In the real world--with actual, paying customers--this would lead to hours, maybe days, of downtime.

In reality this app is probably better suited as an internal line-of-business application that can be maintained better. I could open-source this project, but do any provider agencies in NJ have developers on staff? I know many that don't even bother with in-house IT, so not sure how they'd be interested in taking on developing and maintaining a web application.

I am satisfied with the work I've done on this project already, I just don't know what it's future is. I could abandon some of the features that are break-prone and just extract text and build some neat things around that (AI anyone?). There are some amazing PDF tools already out there like Paperless-Ngx and Stirling-PDF, but the same barrier applies. I still think there is a value here (particularly just extracting text and having a searchable database), just not sure how much.

Anyway, just jotting my thoughts down—part 1/n of building in public. Happy Dread-of-Winter Tuesday!



Comments

comments powered by Disqus