Data scraping (or web scraping) is a method of obtaining large amounts of data from different websites or public sources (more on this below). There is a multitude of ways to accomplish this, however, it isn’t uncommon to use several different methods in conjunction to get the best data or to verify and clean the data. First, however, it is best to discuss some of the legal considerations of using data scraping to attain the information you need.
Generally, if the information is publically available on the internet, it can be scraped legally. The current legal precedent was first set by the Federal District Court, Northern District of California, in 2017 when it granted summary judgment in favor of hiQ Labs in its lawsuit with LinkedIn (later Microsoft Corp.). HiQ labs was scraping data from publically available LinkedIn profiles in order to give businesses the ability to predict their own employees’ likelihood of quitting and therefore their talent turnover. Judge Stephen Chen wrote in his decision that:
“the balance of hardships tips sharply in hiQ‟s favor. hiQ has demonstrated there are serious questions on the merits. In particular, the Court is doubtful that the Computer Fraud and Abuse Act may be invoked by LinkedIn to punish hiQ for accessing publicly available data; the broad interpretation of the CFAA advocated by LinkedIn, if adopted, could profoundly impact open access to the Internet, a result that Congress could not have intended when it enacted the CFAA over three decades ago. “
Linkedin appealed this decision, however in 2019, the 9th circuit court of Appeals agreed with the lower court, stating:
“The CFAA is adopted to prevent deliberate intrusion on someone else’s computer — in particular, computer hacking,” the court said. In essence, the law applies to private information on private computers or hacking as is commonly understood. However, the court’s decision doesn’t give would-be scrapers a complete green light.
Linkedin may still attempt to appeal the circuit court’s decision to the Supreme Court, however, it seems that if you don’t have to agree to terms or authenticate to enter a website then you are likely on stable legal ground for the time being.
If, however, you agree to open a profile or need to authenticate yourself to gain access to the website (such as when you login into Facebook) then you are agreeing to the terms and conditions of that site. If the terms make it clear that you are not allowed to scrap their data, then you are not only in violation of your contract with said website but may also be opening yourself up to other legal issues.
It is less clear what the legal implications are if you simply scrap from a site that has in its terms the condition that users are prohibited from scraping data. While most terms and conditions say that by using any given website you agree to said terms, it’s unclear whether that has much legal standing. In my opinion, if you are using a site, you are agreeing to their terms and should respect that site/company’s wishes. It’s usually better to be safe than sorry.
Coral Digital employs two methods, there are certainly many others (its more than likely that there are countless methods), however, we use two primarily.
One method is we use a scraping tool (we use and love Dataminer) that can extract data from websites, and then download it in a spreadsheet or another local file type. This is fast and effective, however, it is only good for scraping the same data from the same type of pages. For example, if you can find a list of names and email addresses, that are all in the same format they are a good candidate for scraping. This method can be used for more complicated data, but generally, the data must be of uniform format. This method is the quickest and cheapest, but finding data in different places is difficult which makes the next method necessary.
The second method uses the hiring of people to do repetitive tasks. This can be done any several different sites (we primarily use MTurk) and through several different companies. This method can be employed to find similar kinds of data, but which may be stored in myriad places. While this method is incredibly powerful, it is at the end of the day accomplished by humans who may make mistakes or purposely not accomplish the task correctly. These mistakes can be compensated for by having multiple people do the same exact task so that the results can be compared (generally if they match it’s more likely they are correct), however, this costs twice as much money (there are always trade-offs)!
Most often we employ the first method in order to get a list of places where the harder to find data lives and then farm that out using the second method.
There you have it. Data scraping, both its (current) legality and the general strategy on how to use it to accumulate data. Scraping data lives in an interesting grey area. You should carefully consider any implications of scraping data before your attempt to do so and consider whether or not there are other ways to obtain the data you want through paying for it directly. If you are interested in data scraping or have a project that you would like to discuss with us, please set up a time to chat!
The information provided does not, and is not intended to, constitute legal; instead, all information, content, and materials available on this site are for general informational purposes only.