Enable nofollow search engine behavior for PDF files uploaded to web servers by default

Context

To increase the understanding of content delivered on the web it most often makes sense to describe what a user will be downloading. Many PDF files are indexed by search engines allowing a user to find the content via search, and load the document directly, but without any context. This can be problematic as a document may be out of date, list a procedure no longer in use, or simply exist for historical purposes.

By blocking direct PDFs from search engines we are not blocking the content, that still most often lives on the landing page where a PDF is linked from. That content also still would appear in search engines.

By implementing this change, there will be fewer requests for deleting PDFs from Google and other search engines and will create a more positive user experience for site visitors.

This can be done at the server or site level by modifying either the conf or htaccess file.

<Files ~ "\.pdf$">

Header set X-Robots-Tag "noindex, nofollow"

</Files>

 

Status

Consequences

Doing this would potentially delist many PDF files from search engines. Our current inventory of most sites within the wwu.edu domain lists 267 internal PDFs. The vast majority are within singular sites where we could setup exclusions if it is necessary for them to remain listed. There are 4,600+ PDF files listed as external and reviewing the list shows many as part of *.wwu.edu but nothing seems to be stand-alone content that would not be better served by being accessed through a webpage. 

We want a mechanism in place to allow certain PDFs to stay indexed during this transition.

Google's Documentation lists out at the bottom of the page implementation details. It will be important to test if the single file setup would override a blanket do not index all .pdf files.