Scrapebox is a fantastically useful tool for a wide variety of businesses and a lot of different purposes. Small businesses can use it to scrape data on their competitors and their primary keywords. Larger businesses can use it to scrape product details, gather aggregate data for research, or even just gather data on an audience of users from Twitter or the like.
It’s also a dangerous tool to use. Scrapebox knows no human rules. It does what you tell it to do. If what you tell it to do happens to violate the terms and conditions of the site you’re scraping, you could find your IP blocked or your account banned. To a site getting hammered by data requests, Scrapebox can look very much like a DDoS attack, and sites today take those very seriously. So what settings should you use to ensure ethical, safe usage of the tool?
First of all, let’s talk about what Scrapebox is. If you already use Scrapebox and are well aware, skip to the next section. They bill the software as “the Swiss Army Knife of SEO” because it’s a multi-faceted and multi-purpose tool. It’s a scraper, as you might expect. You can point it at a webpage and pull data off that page, and point it at a list of URLs and pull data off of all of them.
Since Scrapebox is primarily an automation tool, it relies heavily on web proxies. Proxies are IP addresses that funnel traffic through them. They’re extremely useful for avoiding IP bans and rate limits. For example, if you want to scrape the top 10 Google search results for a list of 1,000 keywords, Scrapebox can do that very quickly. However, after a certain number of fast hits, Google recognizes that one IP address is making too many rapid calls to their server. They put a captcha in front of the software, which stalls it out.
By using 1,000 different IP addresses – or just 200 in rotation, or what have you – Scrapebox traffic is much slower on any individual IP address. Google no longer thinks one person is making a thousand different requests in ten minutes; they think 200 people are making 5 requests each in the span of ten minutes. It’s a much, much more reasonable amount of traffic, and it’s something Google won’t even blink at.
The point is that Scrapebox gives you access to immense amounts of data where you normally can’t access it quickly. Many sites provide data APIs you can use to pull data. Facebook has their Graph API as an example. If you have a Facebook app with API access, you can pull limited types of data. If you don’t have API access, or you want data that the Facebook API won’t supply, you can use Scrapebox to get it.
Scrapebox circumvents the API restrictions, it can get data APIs won’t provide, it gets around rate limits, and it can perform automated, multi-step tasks to get data that might take several repeated calls and data filtering using a standard API.
Scrapebox has a bunch of different potential functions you can use to scrape different kinds of data in different scenarios.
The last handful of those features, and another half-dozen or more, are all add-ons for the base Scrapebox application.
There are a lot of add-ons, which cost more than just the base license to use. The base license, by the way, is $100. They claim it’s a limited time deal, $97 instead of $197, but I’ve never actually seen it for full price.
I’m morally obligated at this point to give you the usual spiel about ethical use of technology.
Many of the features of Scrapebox technically violate the terms of service of a site. Google, for example, has “you will not attempt to circumvent API limitations” as part of their developer terms of service. The main reasons for this are financial, of course. If a site is selling API access, they don’t want people to use third party software to get the same data and bypass paying for it. Additionally, scraping uses up server resources, which can be expensive. For smaller servers, it can even use up available bandwidth, shutting down the site for legitimate users.
This is all an accepted use of scraping. It’s why you use proxy IPs; to minimize being caught. It’s all technically in violation of various site terms, but very rarely will one scraper be so egregious and so obvious as to get caught.
The actual legality of scraping data is currently a hotly contested issue. There are several ongoing court cases to determine what is and isn’t legal, in fact. This site has a good rundown of the current circumstances.
Other features of Scrapebox are potentially even more dangerous. Widespread automatic posting of blog comments, for example, is very much a spam technique. Even if you’re trying to be reasonable and valuable with your comments, you still end up posting a lot of barely-useful or spammy comments when you aren’t giving them individual attention. Scrapebox can spin content, but it doesn’t have an AI, it doesn’t operate on machine learning, and it can’t contextually make good comments. These kinds of features not only put you on widespread blacklists like Akismet, they can earn your brand an extremely negative reputation.
Scrapebox is, at the end of the day, just a tool. If you use it in a restrained and ethical manner, you can get a lot of value out of it. On the other hand, if you use it to its fullest ability, you run with a lot of risk, and you have no one to blame but yourself.
Many of you didn’t come here for the ethics lecture or the rundown of what the program can do, of course. You came here because the title promised you settings you can actually use. Well, you’ve hit “page down” enough to come to the right section, I suppose.
First of all, you should talk to the proxy server provider you’re using. Some proxies will only support one connection or request at a time. Some will be unlimited. Some had limits around 10, 50, or 100. These are the threads you’re sending through each proxy.
If you set your threads too high, your proxies will begin to be banned or caught in speed filter captchas. If you set them higher than a server admin allows, you might get your access to those proxies revoked, if you’re using a private proxy list. It’s generally a good idea to start small and work your way up. You don’t need to get your data excessively quickly, after all; you can always run the program over night.
If your proxy provider will tell you their maximum thread count, just use a number lower than that. If they have no maximum, use a number that is reasonable for your internet connection and for your purposes.
Second, if possible, use a backconnect proxy. A normal proxy is a single server with a single IP that you forward your traffic through. A backconnect proxy is a swarm of different machines and IPs. Your traffic enters the cloud of proxy potential, leaves it to fetch your data, then goes back through to you.
The primary benefit of a backconnect proxy swarm is randomness. If you have 10 proxies you rotate through, a site like Google can still detect the same behavior in the same pattern coming from 10 different IPs and can link them all together to see what you’re really doing. If you have 10 different machines in a backconnect swarm, there is a much lower chance of a regular pattern appearing. The larger the swarm, the less of a pattern there can possibly be. You can read more about backconnect proxies here.
If you’re scraping results based on keywords, you should use as many different keyword variations as is reasonable.
You can pay Scrapebox for their add-on that will suggest keywords for you, but that costs extra money. Instead, you can use a site like this one to give you a huge list of keyword variations for free. That specific site will start with one keyword and give you every autocomplete variation starting with popular options, then going through the alphabet. If it finishes the alphabet and you haven’t stopped it, it will start with the first keyword generated and repeat the process using that keyword, and on down the list for as long as you want to let it run. You can generate thousands of keywords in under a minute.
If you just want article data, link data, keyword data, or anything that isn’t Google-specific, you might also consider scraping Bing instead. There are two reasons for this. Firstly, Bing is a lot less strict with scraping than Google. They don’t care as much about rate limits or about bot blocking, and their automatic processes don’t work as hard to stop it. It is, in essence, easier to scrape Bing.
Secondly, Bing likely just uses a lot of Google’s results directly. Google even published proof of this in 2011, to which Microsoft basically replied “so what?” There’s a pretty good chance that the data you get from Bing is reasonably congruent with data from Google. As long as the occasional mistake is fine, you can get perfectly usable data from Bing.
One Scrapebox-specific setting is the mass number of search engines it will scan. They do Google, Yahoo, and Bing, of course. They also do Rambler, BigLobe, Goo, Blekko, Ask, Clusty, and dozens more.
If you don’t need data from these search engines, or if you want the data but the results you get from specific engines is low volume or low value, uncheck and stop scanning those search engines. It’s a waste of processor cycles, power, and bandwidth to keep scanning them when the data is useless to you.
Finally, customize your timeout settings. If you’re using backconnect proxies or a private proxy list, you can set the timeout to something low, like 15-30 seconds. Shorter timeouts allow for faster data harvesting, but at the same time, they can overload proxies and get you temporarily shoved off the proxy. Public proxies, which are generally slower as they are, should have longer timeout periods. A 30-90 second range is recommended here.
If you’re using a limited proxy list or you know you’re going to be scraping a large volume of data from a picky site like Facebook or Google directly, use a longer timeout, generally 90 seconds. This helps ensure that you’re not going to be caught by the captchas and filtered. It will harvest data more slowly, but more reliably.
Do you use Scrapebox? If so, what are your ideal settings? Feel free to let me know; it can be good to see what works for others.