Digital textbook extraction

TokyoMonsterTrucker@lemmy.dbzer0.com · 1 year ago

Digital textbook extraction

FlyForABeeGuy@lemmy.dbzer0.com · 1 year ago

I had a few books like that that were directly on a scummy academic editors website. No pdf or usable files. I’m currently far from home, so I can’t tell you exacly what program i used. But i noticed that every page was downloaded in my temporary files as image data (cached version on page). So i had to manually flip a few pages, download them 1 by 1 and naming them correctly. I’ll look ok my pc to try to find the program that did that when I’m back

theoware@sh.itjust.works · 1 year ago

Sounds like you could also use a image downloader browser extension for that

TokyoMonsterTrucker@lemmy.dbzer0.com · 1 year ago

Sounds promising! Please let me know what you find.

FlyForABeeGuy@lemmy.dbzer0.com · 1 year ago

It was MZCacheview but the same autor made one for chrome and a general one. But theoware is probable right, a brower extension could also do it!

TokyoMonsterTrucker@lemmy.dbzer0.com · 1 year ago

Looks like this particular publisher has anticipated cache sniffing. No dice.

theoware@sh.itjust.works · edit-2 1 year ago

You can try printing the page

TokyoMonsterTrucker@lemmy.dbzer0.com · 1 year ago

I’m looking for something a bit more detailed. I’d like to auto-scrape the entire book.

KevonLooney@lemm.ee · 1 year ago

Why don’t you simply open the book in a virtual machine like VMware and hit print? It can print to a PDF.

TokyoMonsterTrucker@lemmy.dbzer0.com · 1 year ago

I can print pages to PDF without a VM. The problem with printing is that these books are over 1000 pages, so I need to automate a good chunk of the process. Ideally, I’d like to capture the XML text for the pdf as well as it will look much better and I will not have to manually crop 1000 PDFs with annoying borders.

KevonLooney@lemm.ee · 1 year ago

Yeah, I believe you can do that by printing to a non-existent printer and then finding the file image waiting in the print queue. I don’t know if it works on Windows 11 but it used to work pretty well.

theoware@sh.itjust.works · edit-2 1 year ago

Then this method probably won’t work for you