What Is the Wayback Machine And How Does It Work?

The Wayback Machine is a digital archive of Internet content, consisting of snapshots of web pages across time. The frequency of web page snapshots is variable, so all web site updates are not recorded.There are sometimes intervals of several weeks or years between snapshots. Web page snapshots usually become available and searchable on the Internet more than 6 months after they are archived. Kivu uses information archived in The Wayback Machine in its computer forensics investigations.

The Wayback Machine was founded in 1996 by Brewster Kahle and Bruce Gilliat, who were also the founders of a company known as Alexa Internet, now an Amazon company. Alexa is a search engine and analytics company that serves as a primary aggregator of Internet content sources, domains, for theWayback Machine. Individuals may also upload and publish a web page to The Wayback Machine for archiving.

Content accumulated within the Wayback Machine’s repository is collected using spidering or web-crawling software. The Wayback Machine’s spidering software identifies a domain, often derived from Alexa, and then follows a series of rules to catalog and retrieve content. The content is captured and stored as web pages.

The snapshots available for a specific domain can be viewed by using the Uniform Resource Locator(URL) formula in the table below. Using the URL formula, the term DOMAIN.COM (bold) is changed to the domain name of interest and then entered into a browser’s Uniform Resource Identifier (URI) address field.


The Wayback Machine does not record everything on the Internet

A web page’s robots.txt file identifies rules for spidering its content. If a web page domain does not permit crawling, the Wayback Machine does not index the domain’s content. In place of content, the Wayback Machine records a “no crawl” message in its archive snapshot for a domain.

The Wayback Machine does not capture content as a user would see content in a browser. Instead, the Wayback Machine extracts content from where it is stored on a server, often, HTML files. For each web page of content, the Wayback Machine captures content that is directly stored in the web page, and if possible, content that is stored in related external files (e.g., image files).

The Wayback Machine searches web pages in a domain by following hyperlinks to other content within the same domain. Hyperlinks to content outside of the domain are not indexed. The Wayback Machine may not capture all content within the same domain. In particular, dynamic web pages may contain missing content, as spidering may not be able to retrieve all software code, images, or other files.

The Wayback Machine works best at cataloging standard HTML pages. However, there are many cases where it does not catalog all content within a web page, and a web page may appear incomplete. Images that are restricted by a robots.txt file appear gray. Dynamic content such as flash applications or content that is reliant on server-side computer code may not be collected.

The Wayback Machine may attempt to compensate for the missing content by linking to other sources (originating from the same domain). One method to substitute missing content is linking to similar content in other Wayback Machine snapshots. A second method is linking to web pages on the “live” web, currently available web pages at the source domain. There are also cases where the Wayback Machine displays an “X”, such as for missing images, or presents what appears to be a blank web page.

HTML or other source code is also archived

The Wayback Machine may capture the links associated with the page content but not acquire all of the content to fully re-create a web page. In the case of a blank archived web page, for example, HTML and other software code can be examined to determine the contents of the page. A review of the underlying HTML code might reveal that the page content is a movie or a flash application. (Underlying software code can be examined using the “View Source” functionality within a browser.)

Wayback Machine data is archived in the United States

The Wayback Machine archives are stored in a Santa Clara, California data center. For disaster recovery purposes, a copy of the Wayback Machine is mirrored to Bibliotheca Alexandrina in Alexandria, Egypt.

