Thanks to Andy, and others, for the suggestion of HTTrack, While it didn’t download the whole site cleanly, being able to see what it was doing helped me realize what problem I was running into.
It seems that this particular site uses a 404 in place of any page that doesn’t exist, rather than redirecting to a 404. For example, if you went to brokenlink.html in the main root of the site, you’d get that address, but the page that loaded would be the contents of the 404. That wasn’t were the problem lay, as it’s fairly common. (I think the 404 on this site does the same thing). The problem was that the links and images from the 404 used relative paths. If you are on the root of the site, it works great. If you’re loading say badfolder\brokenlink.html and get the error page in place of that, none of the images load, and the links are bad, because they’re relative to the site root, which you aren’t in.
If you’re thinking ahead with me here, you’ve probably already realized what happens when a spider that is grabbing pages and following links hits this. It gets the error page, follows the link that points to another page that doesn’t exist in the relative path from the current folder, and gets the error page again. Rinse and repeat, to infinity and beyond.
No wonder the downloads just kept getting larger and larger until they crashed out of memory.
On the plus side, I think, between the various attempts, we probably have everything we needed in the first place, so I don’t have to try and do this again, on the other hand, I’d like to figure out just how I can tell HTTrack or any other tools to stop itself from getting in this loop. Any ideas?
Follow these topics: LitigationSupport, Tech
its been a long time since i’ve used it, but I think you can limit it to X hops so it will bomb out after 5 links (for example). limit it to something like 20 links may help?