SEO Blog • Organic SEO Blog Search Marketing News: USE OF ONLINE DATA IN THE BIG DATA ERA: LEGAL ISSUES RAISED BY THE USE OF WEB CRAWLING AND SCRAPING TOOLS FOR ANALYTICS PURPOSES

Thursday, August 29, 2013

USE OF ONLINE DATA IN THE BIG DATA ERA: LEGAL ISSUES RAISED BY THE USE OF WEB CRAWLING AND SCRAPING TOOLS FOR ANALYTICS PURPOSES

Story originally appeared on Bloomberg News.

In 2010, Pete Warden, a software engineer living in Colorado, developed a software program to “crawl” publicly accessible Facebook pages and “scrape” (i.e., collect) information relating to Facebook’s members. Within hours of deploying his software, the application had visited approximately 500 million pages and collected information related to approximately 220 million Facebook users – including users’ names, location information, friends and interests. Using this dataset, which Mr. Warden offered to release in anonymized form for research purposes, he created a graphical analysis of the regional and relationship patterns among Facebook’s members. The cost of this exercise: about $100. The results: more than 500,000 visits to Mr. Warden’s website, national media coverage, and cease-and-desist warnings from Facebook, which perceived Mr. Warden’s collection of data from its webpages as a violation of its terms of use prohibiting automated access to the website without the company’s permission. Ultimately, in order to avoid a potential legal dispute, Mr. Warden abandoned his plan to release the information he collected, and agreed to delete all copies of the dataset.1Summing up his experience, he later quipped, “Big data? Cheap. Lawyers? Not so much.”2

AUTOMATED WEB CONTENT GATHERERS

The use of web crawlers, scrapers and others automated tools for gathering online content has long been a feature of Internet (to the extent “long” can be used to describe the history of the Internet). For example, searches engines use web crawling “bots” or “spiders” to continuously visit billions of webpages to create relevant and accurate search results, and the Internet Archive – a non-profit digital library that archives historical versions of publicly accessible webpages – has since 1996 used web crawling tools to create a historical record of the Internet comprising 10 quadrillion bytes of data. Others have used similar tools to offer services that compete with or complement the offerings of the scraped websites – including uses of these tools to aggregate news content, and to monitor and facilitate purchases of airlines and concert tickets (with or without the permission or involvement of the scraped website). As Mr. Warden’s experience suggests, the use of these tools pit the interests of website owners in protecting, controlling and profiting from the content they provide against the interests of others who seek to gather and use that content for other purposes (be they harmful, helpful or irrelevant to the website owner). Not surprisingly, the use of these tools has spurred litigation under a variety of theories, including copyright infringement, breach of contract (e.g., website terms of use), “hot news” misappropriation, trespass to chattels, and criminal statutes prohibiting unauthorized access to a computer system or website.

With the advent of Big Data – the increasingly widespread practice of using advanced data analytics to identify trends and patterns in extremely large datasets collected from a variety of sources – the potential applications for scraped data, and the benefits associated with analysis of that data, have increased exponentially. Whereas past cases involving unauthorized web crawling and scraping often involved simple copying and republication of website content in direct competition with the scraped website, the growing use of advanced data analytics is giving rise to instances where the connection between the data analytics service and the scraped website is attenuated and not directly competitive. Nevertheless, the online content of websites that may be scraped is among such businesses most valuable data, and great lengths are understandably taken to protect such content.

Given both the tremendous value and Big Data-driven demand for Internet-based information, and the relative ease by which such information can be compiled using automated data collection tools such as that deployed by Mr. Warden, it is likely that future cases relating to web crawling and scraping will focus on the legal issues raised by automated data gathering for analytics purposes – and what theories a website owner may exercise to protect any factual data so collected and what theories a data collector may use to justify such collection. Few courts, however, have directly addressed the legal issues raised by Big Data or the collection of data for related purposes, leaving uncertain the legal environment faced by website owners wishing to protect the data on their websites, and those who would gather such data for analytics purposes. Without taking sides – and while recognizing that the legal landscape relating to the Internet is constantly evolving, with previously challenged technologies such as search engines now recognized as nearly per se legitimate while others such as peer-to-peer networks have continually been subject to scrutiny – this article seeks to outline the legal issues such parties may face. In doing so, this article will consider the legal theories that have been applied in prior cases relating to the use of web crawling and scraping tools in other contexts, and will identify issues relating to whether claims under these theories are likely to succeed in connection with disputes relating to automated data collection for Big Data and analytics purposes.

LEGAL THEORIES RELATED TO AUTOMATED ONLINE DATA COLLECTION

A. COPYRIGHT INFRINGEMENT.

The Copyright Act protects original expressions that are fixed in a tangible medium, including mediums such as computer memory or a web server.3 These protections extend not only to original expressions such as images contained on a website, but also the underlying code that enables the display of any content on the website – including facts displayed on a website that are not otherwise entitled to copyright protection. Accordingly, because web crawling and scraping tools generally index information on a targeted webpage regardless of whether the tool seeks to obtain copyrighted content or unprotected facts4, courts have recognized claims for copyright infringement in connection with the use of web crawling and scraping tools.5

Because some courts have recognized that such activities may infringe a website owner’s copyrights, the focus in such cases is generally on whether the web crawling or scraping at issue is a fair use of the copyrighted content. For example, in Kelly v. Arriba Soft Corp., the defendant search engine conceded that its display of low-resolution “thumbnail” copies of high-resolution photographs constituted reproduction of those photographs, but argued that such display was a transformative, fair use of the copied photographs. The Ninth Circuit agreed – holding that the search engine’s display of low-resolution photographs to facilitate the general public’s access to information on the Internet was highly transformative of, and did not provide a substitute for, the plaintiff’s high-resolution photographs whose purpose was primarily artistic.6 Notably, the fact that such use was for a commercial purpose did not bar the court’s finding that the search engine made a fair use of plaintiff’s copyrighted photograph.7 In contrast, in Associated Press v. Meltwater Holdings U.S., Inc., the court found that an online news aggregator that provided its subscribers with nearly 500-character excerpts of copyrighted articles scraped from the website’s of the Associate Press’s licensees did not engage in a fair use of those articles. The court distinguished the news aggregator’s services from those at issue in Kelly on the grounds that the news aggregator did not facilitate the general public’s access to information on the Internet, but instead only provided word-for-word excerpts of the copied articles to the aggregator’s paying customers without transforming that content in any way.8 The court further held that the aggregator’s use of that content to generate analytics relating to the online news sources it covered, while potentially transformative in and of itself, did not render the aggregator’s excerpting transformative insofar as the analytics and excerpting were separate and distinct services.9

While even incidental reproduction of copyrighted webpage material may give rise to copyright liability, courts have also recognized that such reproduction may constitute a fair use of the protected content. For example, inTicketmaster Corp. v. Tickets.com, Inc., the defendant argued that the momentary copying of Ticketmaster’s webpages by its spiders for the purpose of extracting factual information concerning concert times, ticket prices, and venues that defendant then posted to its website constituted a fair use. The court agreed. In so finding, the court emphasized that the copying was momentary, the effect on the market value of the copyrighted material was “nil”, and that the “amount and substantiality” of the material used was negligible insofar as defendant did not reproduce the copyrighted material on its webpage. Further, the court observed that the central purpose of the Copyright Act – i.e., “to secure a fair return for an author’s creative labor and to stimulate artistic creativity for the general good” – would not be served by restricting defendant from momentarily copying Ticketmaster’s webpages for the purpose of obtaining non-protected, factual information.10

In addition to the fair use defense, courts have also considered whether a plaintiff’s copyright claims are subject to implied license or estoppel defenses based on its failure to deploy the “robots.txt” protocol to deter unwanted web crawling or scraping. The robots.txt protocol is industry-standard programming language that a website may deploy to instruct cooperating web crawlers generally, or certain web crawlers specifically, to voluntarily refrain from accessing all or part of the website.11 In Parker v. Yahoo, Inc., the court held that the plaintiff’s failure to deploy the protocol granted Yahoo an implied license to create cache copies of his website where plaintiff was aware that Yahoo – which has a policy of not creating cache copies of websites that deploy the protocol – would do so in the absence of the protocol.12 Conversely, in Meltwater, the court rejected the defendants’ implied license and estoppel defenses based on the Associated Press’s purported failure to require its licensees to deploy the protocol. The court distinguished Parker on several grounds, including that the defendants reserved the right to ignore the protocol if deployed. The court further emphasized that the defendants’ arguments, if accepted, would shift the burden of preventing infringement to the copyright owner, and threatened the “openness of the Internet” by forcing copyright owners to choose between deploying the protocol and deterring all web crawlers (including search engines which may help users locate the website), and refraining from doing so and losing the right to prevent unauthorized use of its protected content.13

With respect to future cases involving use of scraped content for analytics purposes, courts are likely to follow a similar analysis driven by the facts of the specific case. Issues regarding whether the copying is momentary, whether the information extracted is factual, the effect on the market value of the copyrighted material, and the amount and substantiality of the material used are likely to be key issues in these cases. Courts are further likely to focus on whether the object of the Copyright Act – “to secure a fair return for an author’s creative labor and to stimulate artistic creativity for the general good” – would be served by prohibiting the challenged conduct. Courts are also likely to consider, in the context of defenses to copyright claims, the specific circumstances relating to a website’s deployment of the robots.txt protocol, including whether the defendant has a practice or policy of complying with the protocol if deployed.

B. BREACH OF CONTRACT.

Most commercial websites contain terms of use that provide that access and/or use of the website is premised on the user’s agreement to such terms.14 A claim sometimes made in cases regarding web crawling or scraping is that the defendant violated the terms of use by crawling and scraping content. While these cases have explored somewhat novel uses of technology, they often turn on fundamental issues of contract15 – including whether the targeted website’s terms of use are enforceable as against the defendant, whether the conduct complained of violates those terms, and whether any such violation causes any compensable damages. These cases suggests that use of such tools to gather data may give rise to a claim for breach of contract, while also demonstrating the potential hurdles to prevailing on such claims. These issues are discussed in turn.

1. ENFORCEABILITY OF WEBSITE TERMS OF USE.

As is the general rule with any contract, a website’s terms of use will generally be deemed enforceable if mutually agreed to by the parties. In determining whether such mutual agreement exists, courts look to whether the terms of use constitute a “clickwrap” agreement – which typically require that a visitor indicate her agreement by clicking an “I accept” icon before accessing the website – or a “browsewrap” agreement – pursuant to which the user is provided with notice of the website’s terms of use, and informed that use of the website constitutes agreement to those terms.16 Clickwrap agreements, because they require a user to formally indicate his knowledge and awareness of the terms of use, are generally found enforceable.17Browsewrap agreements have also generally been found enforceable where the defendant has actual knowledge of the terms of use or constructive knowledge of such terms.18 Actual knowledge is sometimes demonstrated by evidence that a defendant was advised of its violations of the terms of use via a cease-and-desist letter from plaintiff.19 Constructive knowledge is sometimes found where a website’s terms of use are prominently or conspicuously displayed on the website, such as where a hyperlink to those terms is underlined and set forth in distinctively colored text.20

Regardless of whether a website’s terms of use are clickwrap or browsewrap, the defendant’s failure to read those terms is generally found irrelevant to the enforceability of its terms.21 One court disregarded arguments that awareness of a website’s terms of use could not be imputed to a party who accessed that website using a web crawling or scraping tool that is unable to detect, let alone agree, to such terms.22 Similarly, one court imputed knowledge of a website’s terms of use to a defendant who had repeatedly accessed that website using such tools.23 Nevertheless, these cases are, again, intensely factually driven, and courts have also declined to enforce terms of use where a plaintiff has failed to sufficiently establish that the defendant knew or should have known of those terms (e.g., because the terms are inconspicuous), even where the defendant repeatedly accessed a website using web crawling and scraping tools.24

Issues regarding enforceability of contract are likely to continue to be an issue addressed by courts in this area, with content providers citing clickwrap agreements and actual knowledge of terms, and those using crawling and scraping tools arguing a lack of mutual assent to such terms.

2. TERMS OF USE THAT MAY PROHIBIT AUTOMATED DATA COLLECTION.

The terms of use for websites frequently include clauses prohibiting access or use of the website by web crawlers, scrapers or other robots, including for purposes of data collection. Courts have recognized causes of action for breaches of contract based on the use of web crawling or scraping tools in violation of such provisions.25

Also common are terms of use that limit visitors to personal and/or non-commercial use of a website. For example, in Southwest Airlines Co. v. BoardFirst, LLC, the plaintiff airline alleged that the defendant violated its terms of use restricting access to Southwest’s website for “personal, non-commercial purposes” by offering a commercial service that helped Southwest’s customers take advantage of the company’s “open” seating policy and check-in process to obtain priority seating in the front of the plane. The court granted Southwest’s motion for summary judgment on its breach of contract claim, finding that the defendant’s conduct directly contravened Southwest’s prohibition on commercial uses of Southwest’s website.26

Cases addressing the purported violations of these terms tend to hinge on the precise language of the contractual provisions at issue, and the scope of the agreement between the parties that can be ascertained from that language. Thus, for example, in Southwest, the court rejected defendant’s argument that Southwest’s terms of use were too ambiguous to be enforced against defendant where those terms specifically prohibited use of the website “for the purpose of checking [c]ustomers in online or attempting to obtain for them a boarding pass in any certain boarding group.” Defendant’s services, which helped Southwest’s customers obtain priority seating, fell “within the heart of this proscription.”27 In contrast, in TrueBeginnings, LLC v. Spark Network Servs., Inc., the court found that the defendant did not violate the terms of service of plaintiff’s dating website – which limited use of the “website and related services” to a visitor’s “sole, personal use” – by visiting the website to obtain evidence for use in a patent infringement action against plaintiff. In so holding, the court analyzed the entirety of plaintiff’s terms of use, including those prohibiting use of web crawlers or spiders to gather data from the website, to determine that they related to use of the website’s dating services. Defendant’s use of the website to gather evidence for use in a patent lawsuit did not involve unauthorized uses of the dating services, and thus did not breach plaintiff’s terms of use.28

Terms of use designed to prevent reproduction of website content also raise issues regarding whether such claims are preempted by copyright claims. Courts have generally declined to find claims for enforcement of such terms to be preempted by the Copyright Act, reasoning that terms of use restricting the manner by which a website can be accessed or used go beyond the protections provided under the Copyright Act. For example, in Internet Archive v. Shell, the Internet Archive sought dismissal on preemption grounds of the plaintiff’s claim for breach of contract relating to Internet Archive’s crawling and indexing of plaintiff’s website in violation of terms of use that prohibited any copying of plaintiff’s website for a “commercial or financial purpose.” The court rejected Internet Archive’s preemption argument, finding that Internet Archive’s alleged agreement to refrain from use of the material on plaintiff’s website “for commercial or financial purposes … lie[s] well beyond the protections [the website owner] receives through the Copyright Act”29 (which, as discussed, allows for limited use of copyrighted content, even for a commercial purpose, if sufficiently transformative or unlikely to provide a substitute for the copyrighted work). The court reached this conclusion despite the fact that the Internet Archive is a non-profit entity – apparently on the basis of disputed allegations that Internet Archive’s copying of the content at issue allowed it to “acquir[e] … grant awards, donations, … and the expectation of acquiring additional intellectual property.”30

These cases suggest that future contractual disputes relating to web crawling or scraping for analytics purposes based on terms of use violations will likely focus on the proscriptions on automated data collection that are set forth in those terms of use.

3. DAMAGES RELATING TO UNAUTHORIZED DATA COLLECTION.

The cases discussed above establish that website terms of use may be enforced against any party who accesses or uses a website in violation of those terms, and that, if sufficiently clear and unambiguous, those terms may prohibit any automated data collection from the website. However, a breach of contract claim also requires a showing of damages. To date, few of the cases involving breaches of contract relating to website terms of use have been decided on the merits. As a result, the issue of damages in such cases has received scant attention in reported case law. Those cases that have addressed the damages issue acknowledge the challenges and showing required to establish damages relating to violations of website terms of use.

For example, in Southwest Airlines, the court granted summary judgment to Southwest on its breach of contract claim based on its finding that Southwest sufficiently demonstrated that defendant’s services allowed Southwest customers to avoid the online check-in process, thereby decreasing web traffic to Southwest’s website. By decreasing that traffic, the defendant deprived Southwest of valuable selling and advertising opportunities, and also interfered with Southwest’s brand-building opportunities. Nonetheless, while Southwest established that it suffered some form of harm from the defendant’s breach of the terms of use, the court declined to award any damages – finding that calculation of damages was “impossible.” Though it declined to award any damages, the court granted a permanent injunction in connection with Southwest’ breach of contract claim.31

Indeed, because damages relating to violations of website terms of use may in some circumstances be difficult if not impossible to quantify, some courts have looked to liquidated damages provisions as an estimate of such damages. In Myspace, Inc. v. The Globe.com, MySpace alleged that the defendant used an automated script to send spam e-mails from various MySpace accounts established by defendant in violation of MySpace’s terms of service providing that “MySpace is for … personal use … only and may not be used in connection with any commercial endeavors,” and which prohibited “any automated use of the system” or “transmission of … spam[].” MySpace’s terms also provided that users agreed to pay $50 for each item of spam sent in violation of MySpace term’s as “a[n] … estimation of such harm.” The court granted summary judgment on MySpace’s motion for summary judgment on its breach of contract claim, and found that – because MySpace’s actual damages from defendant’s conduct was impracticable or extremely difficult to determine – liquidated damages of $50 per spam message was a reasonable measure of damages.32

The issue of damages is, of course, an intensely factual determination, but it should be noted that this issue is likely to play a key role in these cases in the future – with content owners trying to either quantify actual damages or establish the applicability of liquidated damages provisions, and those who use crawling and scraping tools arguing the impossibility of establishing such amounts. Based on the difficulty in establishing damages, content owners may also seek injunctive relief in such cases.

C. COMPUTER FRAUD AND ABUSE ACT.

Courts have also considered whether web crawling or scraping in breach of a website’s terms of service constitutes a violation of the Computer Fraud and Abuse Act (“CFAA”), which prohibits access to a computer, website, server or database either “without authorization” or in way that “exceeds authorized access” of the computer.33 While these terms have been variously defined, in essence, a person who accesses a computer “without authorization” does so without any permission at all, while a person “exceeds authorized access” where she “has permission to access the computer, but accesses information on the computer that the person is not entitled to access.”34 So long as a computer is publicly accessible, and not protected by password or other security measures, courts have declined to find any access of the website to be “without authorization.”35Conversely, a CFAA claim may lie where a computer or website is protected from unauthorized access, either by technical measures or even explicit warnings in a cease-and-desist letter.36

Courts are split, however, as to whether access of a website in a manner prohibited by its terms of use “exceeds authorized access” of the website in violation of the CFAA. For example, in an early case on this topic, a federal court in Virginia granted summary judgment on AOL’s CFAA claim based on the defendant’s admission that it harvested email addresses from AOL’s website in violation of its terms of use.37 Several years later, in 2003, the Court of Appeals for the First Circuit seemingly agreed with this theory by stating in dicta that “[a] lack of authorization could be established by an explicit statement on a website restricting access.”38

These decisions, however, have been greeted with skepticism by later courts and commentators.39 For example, in 2012, the Ninth Circuit, held in an en banc decision captioned U.S. v. Nosal that “the phrase ‘exceeds authorized access’ in the CFAA does not extend to violations of use restrictions,” but rather concerns “hacking—the circumvention of technological access barriers.”40 In reaching this decision, the Ninth Circuit emphasized the legislative history of the CFAA, noting that it was enacted in 1984 “primarily to address the growing problem of computer hacking.”41 The court further discussed the absurd results that would follow from criminalizing violations of website terms of use – e.g., on dating websites that purport to require honest self-descriptions, describing “yourself as ‘tall, dark and handsome,’ when you’re actually short and homely, will earn you a handsome orange jumpsuit” – and moreover, would allow for ever-shifting grounds for criminal liability as website terms of use are subject to change at any time, in any way, at the website owner’s complete discretion. Thus, “behavior that wasn’t criminal yesterday can become criminal today without an act of Congress, and without any notice whatsoever.”42

While the current trend appears to be to reject broad theories that allow terms of use violations to be used as a basis to establish criminal liability under the CFAA (or analogous state statutes), this is a still an unresolved area in most circuits – and one that will likely further be argued in crawling and scraping cases.

D. HOT NEWS MISAPPROPRIATION.

In addition to asserting copyright claims based on incidental reproduction of copyrighted webpage material, numerous plaintiffs have asserted claims for hot news misappropriation relating to scraping of purely factual information. “Hot news” misappropriation – once a claim that existed under the federal common law, but which now exists only under the laws of five states43 – provides a cause of action where a party reproduces factual, time-sensitive information that was gathered at the effort and expense of another party, and thereby deprives the gathering party of the commercial value of that information. Thus, for example, in Int’l News Serv. v. Associated Press, the Supreme Court in 1918 recognized a claim under federal common law for hot news misappropriation in connection with a wire service’s re-publication of breaking news gathered by the Associated Press, which thereby deprived the Associated Press of the news value of its reporting.44 The court justified its decision as protecting the “quasi-property” rights of profit seeking entrepreneurs who gathered time-sensitive information from those who would free-ride on the efforts of those entrepreneurs.45

Since hot news misappropriation generally concerns factual information rather than content that is subject to copyright protection, it is generally found not to be preempted by the Copyright Act.46 However, courts have recognized hot news misappropriation as an extremely narrow claim that survives preemption only in very narrow circumstances that mirror the circumstances in Int’l News Serv. For example, in Barclays Capital Inc. v. Theflyonthewall.com, Inc., financial services firms alleged claims for copyright infringement and hot news misappropriation against a news aggregation website that reported on investment recommendations issued by the firms to their clients who paid to receive those recommendations before they became generally known to the investment community. On appeal from a denial of the defendant’s motion to dismiss the hot news claim, the court found that plaintiff’s claim was preempted by the Copyright Act. In so finding, the court emphasized that the plaintiffs’ claim lacked an “indispensable element of an INS ‘hot news’ claim,” i.e., “free-riding by a defendant on a plaintiff’s product, enabling the defendant to produce a directly competitive product for less money because it has lower costs.”47 Rather, though the defendant’s conduct potentially threatened plaintiffs’ businesses, the defendant was actually breaking news generated by the plaintiffs’ recommendations (and attributing the recommendations to plaintiffs), rather than merely repackaging news that had been reported by plaintiffs.48

The Barclays case suggests the difficulty of stating a valid hot news misappropriation claim against a party engaged in automated data collection for purposes of data analytics. In many factual scenarios, scraping of information would not appear to qualify as “free-riding” within the meaning of INS so long as the scraper did not attempt to pass the information off as his own without attribution to the content provider. Indeed, many factual circumstances would appear similar to the recommendations at issue in Barclays, where the information is only valuable because it was attributed to the source. The fact that data analytics often involves the use of information to create entirely new insights (including in combination with information from other sources) suggests further difficulties in establishing the requisite “free-riding,” which under Barclays involves demonstrating that the underlying information was used to produce a directly competitive product.

E. TRESPASS TO CHATTELS.

Courts have also recognized, in certain narrow circumstances, that unauthorized use of web crawling or scraping tools can give rise to a trespass to chattels claim, which “lies where an intentional interference with the possession of personal property has proximately cause injury.”49 For example, in eBay, Inc. v. Bidder’s Edge, Inc., eBay brought a trespass to chattels claim against the defendant, an online auction aggregation service that scraped auction information from eBay’s website using spiders that accessed the website approximately 100,000 times per day in violation of eBay’s terms of service and in defiance of cease-and-desist demands from eBay. eBay also moved to preliminary enjoin the defendant from accessing its website. In granting that motion, and finding that eBay was likely to prevail on its trespass to chattels claim, the court relied on the fact that defendant’s spiders consumed a portion – albeit very small – of eBay’s server and server capacity, and thereby “deprived eBay of the ability to use that portion of its personal property for its own purposes.”50

In contrast, where tangible interference is absent, or is no more than theoretical or de minimus, courts have declined to recognize claims for trespass to chattel relating to the use of web crawling or scraping tools. For example, in Tickets.com, the court granted summary judgment dismissing Ticketmaster’s trespass to chattel because Ticketmaster failed to present any evidence that its competitor’s scraping of its website either caused physical harm to Ticketmaster’s servers or otherwise impeded Ticketmaster’s use or utility of its servers. In so holding, the court criticized the decision of the eBay court, and required a showing of “some tangible interference with the use or operation of the computer being invaded by the spider.” 51 Later courts have generally agreed with the holding in Tickets.com.52

To the extent that Tickets.com presents the prevailing statement of law, and evidence of a tangible interference with a computer or server is necessary to state a claim for trespass to chattels based on unauthorized web crawling or scraping, courts are likely in the future to focus on evidence of tangible interference with systems.53

CONCLUSION

As indicated above, the legal landscape relating to web crawling and scraping is still taking shape—particularly insofar as few courts have considered claims based on crawling or scraping for analytics purposes. Further, because most cases involving the use of web crawling and scraping tools in other contexts have been highly fact specific, it is difficult to identify bright line rules for determining when use of such tools for analytics purposes is likely to give rise to liability. Nonetheless, the cases discussed above suggest a number of issues that should be considered both by website owners and by those who seek to perform analytics using data gathered from web-based sources.

These issues include (1) the language of the terms of use or service, and whether such terms address access to the website through automated means, use of any data collected through such means, and use of the website for anything other than the user’s personal, non-commercial use; (2) the enforceability of the terms of use, for example, whether they are presented to the user through a clickwrap mechanism that requires the user to indicate his or her assent to those terms as opposed to a browsewrap agreement, or on a terms of use page that can be reached through a conspicuous link on every other page on the website and which indicates that any use of the website is subject to the user’s agreement to those terms; (3) use of technological tools to deter unwanted crawling or scraping, including but not limited to the robots.txt protocols; (4) whether the website owner will license or authorize uses of content; (5) whether access to the website is protected such that a claim the CFAA or California’s Penal Section 502 may be alleged; and (6) the extent to which the website content is protected by copyrighted.

Ultimately, while the claims and theories that may be advanced in connection with the use of web crawling and scraping tools for analytics purposes have yet to be deeply explored by courts, this is likely a temporary state of affairs. Rather, given the increasing number and availability of tools for aggregation and analysis of content in the Big Data era, courts will ultimately be required to address these complicated issues.