Advice for Dealing with Broken URLs

My NANO writing project has several appendices of stuff that didn’t fit into the main chapters. Here’s the still work in progress but hopefully useful section on recovering broken URLs. Other advice on recovering broken links welcomed in comments:

Appendix E – What to do when URLs/websites break.

The resources in this book are mostly web-based. This was a deliberate choice on my part. Admittedly it was easier for me to browse the web and make comments than to do a bunch of interlibrary loans and spend quality time at the National Archives. There were benefits for you also. The main benefit being the same as mine – desktop research in most cases. We are mostly an immediate gratification sort of society and I am not too proud to cater to it when quality information is available.

Web resources have their drawbacks, the most dramatic of which is link rot.  Some URLs are more durable than others, but chances are good that at some point you are going to visit a site or ebook from this book and get a 404 (File Not Found) error. In most cases, especially with US federal government materials, this is not the end of the world. There are several techniques that you can use to find the missing resource. There are more options for finding documents than entire websites and this appendix will help you find both.

Finding documents published on the web with a broken URL:
1) Try trimming the URL to find a linking page. Sometimes when you get a 404 error on a document, the item has just been moved around the website. Finding the page that lists the title of your document will help you find your way back.

For this method, let’s use:

Manual of the Medical Department (MANMED), NAVMED P-117 at http://www.med.navy.mil/directives/Pages/NAVMEDP-MANMED.aspx

Let’s pretend the full URL leads you to a 404 error. What we do now is to start trimming the URL back to each forward slash. So our URL would become:

http://www.med.navy.mil/directives/Pages/ – Sometimes this would be enough, but as of November 2011, this led to page that says “Access denied. You do not have permission to perform this action or access this resource.” Undeterred, we clip Pages/ from the URL, leaving us with:

http://www.med.navy.mil/directives/ – This URL rewards us with a page called “Navy Medicine Directives.” Looking at the left-hand column, we see MANMED listed. Click on that are we are back to our original document.

2) Try going to the top level of a website and use its search box (if available). Because of varied security practices, the URL trimming technique will sometimes result in failure even if the document is still available on that website. In these cases, you’ll want to go to the top level of the site and search your document title in the search box that is almost always in the top right hand corner of the page.

Keeping with our MANMED example, we go to http://www.med.navy.mil and, seeing a search box, type in MANMED. This brings up a list of result of which the MANMED manual is the top hit. Click on it and you’re back to your original document.

Two reasons this technique might not work for you are either the document really has been removed from the website OR the document is still there, but the site’s search engine has not been updated. In the second case you’ll see a promising search result but when you click on it, you get the same 404 error you started out with. But we still have three more tricks up our sleeve for find the elusive document.

3) Try using the Internet Archive’s Wayback Machine to view the document. The Internet Archive is a nonprofit that periodical takes snapshots of the web. Their “Wayback Machine” available at www.archive.org allows one to view websites as they existed the past unless the site owners have disabled archiving. For technical reasons, some other pages and files don’t get collected either. Explaining why that is so is beyond the scope of this book.

Popping our MANMED URL of http://www.med.navy.mil/directives/Pages/NAVMEDP-MANMED.aspx into the Wayback Machine, we are presented with a calendar. Dates with blue circles indicate available crawled content. We select August 23, 2010 and are presented with a table of contents. Somewhat frustratingly, we don’t get the entire document. But at least the table of contents can help us decide if we want to pursue the document further.

An example of a document that has been successfully preserved on the Wayback machine is the Alaska Department of Natural Resources document Time-Saving Tips for Prospective Gold Seekers where an archived copy of the full document can be found at http://web.archive.org/web/20070226025703/http://www.dnr.state.ak.us/mlw/factsht/mine_fs/timesavi.pdf. The Wayback Machine is always worth a check.

4) Use Google or your favorite search engine to see if another site has a copy of the document you are looking for. Remember the Abu Gharib Taguba Report on Torture? Even though it was technically classified, it was copied widely. The same is true for ordinary government documents, and sadly, for copyrighted materials.

Going back to our MANMED example, let’s go to Google and do the search navy MANMED. Because Google personalizes searches, your results might not match mine. On the first page of results I got, I saw three promising hits not from the Navy. One was a simple list of files and otherwise content free. Another site linked back to the Navy site, so not good for our example. But the third, a listing from a company called Brookside, had a copy of the MANMED manual on their own server at http://www.brooksidepress.org/Products/ManMed/Manmed.htm.  They were nice and specified that they had pulled this from the Navy and that people should go to the Navy for the latest version. Not all sites will tell you were the official version was supposed to be. Using this method you may also run a risk of picking up a document that was altered from the original version though that hasn’t happened to me.

As an aside, while http://www.brooksidepress.org/Products/ManMed/Manmed.htm is not a government site, it is a useful URL to pick apart. As of November 2011, if you trim off ManMed/Manmed.htm, to get http://www.brooksidepress.org/Products/,you’ll find yourself in a directory listing leading to all sorts of medical publications and product pages. Try a few and see what happens.

5) Go to WorldCat and see if there is physical copy or library-digitized copy of your document. If you’ve gone through all of the steps above and still have found nothing, it probably means your document is no longer on the open web. You still have one final trick up your sleeve – WorldCat at worldcat.org. Think of WorldCat as a worldwide card catalog or global book location service.

Once you get into WorldCat, there will be five usual possibilities:

  1. A physical copy is available somewhere in the world and you can ask your local library to do an Interlibrary Loan to get it for you.
  2. A physical copy is available somewhere in the world, but the holding library will not lend it out. In these cases you can either visit the library, whereever it is or you can ask to have a chapter or two and/or the table of contents to be copied for you. Some will honor this and others will not.
  3. An electronic copy is available for free but was in a library respository that Google doesn’t index. In these cases, just click on the link provided and your home free.
  4. An electronic copy is available, but licensing restrictions keep you from seeing it. In some cases, interlibrary loan might help you. Often it won’t. Restrictions on copying of copyrighted digital materials are far harsher than their print counterparts.
  5. There are no hits in WorldCat. If you strike out here after carrying the steps above, you’re pretty much out of luck. If it’s a federal report, you MIGHT be able to get something through a Freedom of Information Act (FOIA) request, but it may be a long and expensive undertaking. For more information about FOIA and ways to request information through it, see the National Security Archive’s FOIA page at http://www.gwu.edu/~nsarchiv/nsa/foia.html.

Continuing with our example of MANMED, we ought to use the spelled out titled.

Manual of the Medical Department, just in case there are foreign language hits from MANMED. Adding Navy would be optional and done if you drowning in non-Navy hits. As it turns out, a search on “Manual of the Medical Department” in WorldCat brings up 24 hits and the first page seems to be all Navy. Some are older editions, which might have their own usefulness from a writer’s point of view.

Examining the first record at http://www.worldcat.org/oclc/2607685, we find that title is held by 41 libraries. If a library name is underlined, you can see where at that library the item is. Chances are you won’t need to. Just print off the page with the catalog record and take it to your library as an interlibrary loan.

There might be other options though. Let’s click on the “back” button of your browser and go back to the search results. See the left-hand column? It’s a set of facets that let you narrow in on your desired item. There’s format, author, year, language and topic. Notice that there is an e-book option under format. Click on the check box to see the four e-book results. Ironically, in this particular case, none of them is the manual you are looking for. But there are some links to older editions from the 1910s and 1880s that could be useful in historical stories.

If there’s an e-book option, it’s worth checking out because in some cases it will be a case of immediate gratification.

Finding websites that have just plain vanished:

In my experience, there are two ways of finding vanished websites, going to the Wayback Machine using method 3 above and using your favorite search engine. If you’ve got a printout or clearly remember the title of the website, put that title in quotes and search it. If the website domain has changed (i.e. instead of http://www.med.navy.mil, the site is now http://medicine.navy.mil or http://medicine.dod.mil/navy, then your title search, if unique enough will bring to the new website address.

If neither the Wayback Machine nor the search engine method above brings back your website, then it is is probably just plain gone. If you’ve discovered another method to recover broken or removed websites, I’d love to hear about it.

