Web scraping generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.
Scraper sites
A typical example application for web scraping is a web crawler that copies content from one or more existing websites in order to generate a scraper site. The result can range from fair use excerpts or reproduction of text and content, to plagiarized content. In some instances, plagiarized content may be used as an illicit means to increase traffic and advertising revenue. The typical scraper website generates revenue using Google AdSense, hence the term ‘Made For AdSense’ or MFA website.
Web scraping differs from screen scraping in the sense that a website is really not a visual screen, but a live HTML/JavaScript-based content, with a graphics interface in front of it. Therefore, web scraping does not involve working at the visual interface as screen scraping, but rather working on the underlying object structure (Document Object Model) of the HTML and JavaScript.
Web scraping also differs from screen scraping in that screen scraping typically occurs many times from the same dynamic screen “page”, whereas web scraping occurs only once per web page over many different static web pages. Recursive web scraping, by following links to other pages over many web sites, is called “web harvesting”. Web harvesting is necessarily performed by a software called a bot or a “webbot”, “crawler”, “harvester” or “spider” with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions. Web harvesters are typically demonised, while “webbots” are often typecast as benevolent.
There are legal web scraping sites that provide free content and are commonly used by webmasters looking to populate a hastily made site with web content, often to profit by some means from the traffic the article hopefully brings. This content does not help the ranking of the site in search engine results because the content is not original to that page.[1] Original content is a priority of search engines. Use of free articles usually requires one to link back to the free article site, as well as to a link(s) provided by the author. The site Wikipedia.org, (particularly the english Wikipedia) is a common target for web scraping.
Legal issues
Although scraping is against the terms of use of some websites, the enforceability of these terms is unclear. Outright duplication of content is, of course, illegal, but the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. Also, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union.
U.S. courts have acknowledged that users of “scrapers” or “robots” may be held liable for committing trespass to chattels, which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff’s possessory interest in the computer system and that the defendant’s unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.
In Australia, the 2003 Spam Act outlaws some forms of web harvesting
Technical measures to stop bots
A web master can use various measures to stop or slow a bot. Some techniques include:
- Blocking an IP address. This will also block all browsing from that address.
- If the application is well behaved, adding entries to robots.txt will be adhered to. You can stop Google and other well-behaved bots this way.
- Sometimes bots declare who they are. Well behaved ones do (for example ‘googlebot’). They can be blocked on that basis. Unfortunately, malicious bots may declare they are a normal browser.
- Bots can be blocked by excess traffic monitoring.
- Bots can be blocked with tools to verify that it is a real person accessing the site, such as the CAPTCHA project.
- Sometimes bots can be blocked with carefully crafted Javascript.
- Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.