Published: Jul 2, 2024 by Isaac Johnson

I think ScrapingAnt landed in my “things to check out” from a web advert. I recently had some cycles to signup and give it a go. I’ll explore the WebUI as well as using ScrapingAnt with curl.

I found ScrapingBee, which while it seems a copy, actually came out just shortly before ScrapingAnt. It too offers webscraping as a service. We’ll signup and try it as well.

Lastly, we’ll compare prices of these two small SaaS services before wrapping up.

Let’s start with the free tier

I can signup wiht Github or Google. I’ll use the latter

I have now 10k API credits to use then I need to pay for a plan.

Let’s give it a try.

Say I want to track when new GDG events are happening in my area

I’ll start by entering that URL into the Request Generator. For now, I’ll leave it as a curl example

The request result cost 10 credits and just showed me the source HTML

Perhaps I need a JS Snippet?

I asked Gemini for help

I plugged it in, but didn’t see any result after way more than a minute

Co-pilot gave me a different suggestion

fetch('https://gdg.community.dev/gdg-twin-cities/')
  .then(response => response.text())
  .then(html => {
    const parser = new DOMParser();
    const doc = parser.parseFromString(html, 'text/html');
    const events = doc.querySelectorAll('.event-item'); // Use the correct selector for event items
    events.forEach(event => {
      const title = event.querySelector('.event-title').innerText; // Use the correct selector for the title
      const date = event.querySelector('.event-date').innerText; // Use the correct selector for the date
      console.log(`Title: ${title}, Date: ${date}`);
    });
  })
  .catch(error => {
    console.error('Error fetching or parsing the content:', error);
  });

Via API key

I can dump the output to a file

$ curl --request GET --url 'https://api.scrapingant.com/v2/general?url=https%3A%2F%2Fgdg.community.dev%2Fgdg-twin-cities%2F&x-api-key=d198b446e87d4126b93932a030233da1' > t.o

I could also parse out the specific parts I wanted, such as the new Sessions and dates:

$ curl --request GET --url 'https://api.scrapingant.com/v2/general?url=https%3A%2F%2Fgdg.community.dev%2Fgdg-twin-cities%2F&x-api-key=d198b446e87d4126b93932a030233da1' | grep -oE '"event_type_title":"[^"]+"|start_date":"[^"]+"|title":"[^"]+"' | awk -F':' '{
print $2}' | tr -d '"'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  478k  100  478k    0     0  23799      0  0:00:20  0:00:20 --:--:--  118k
GDG Twin Cities
Workshop / Study Group
2024-06-26T21
Build with AI
Speaker Session / Tech Talk
2024-06-26T14
GDG Cloud Southlake #34
Workshop / Study Group
2024-06-22T13
Build With AI, Fine Tuning Gemini
Workshop / Study Group
2024-06-19T21
Build with AI
Speaker Session / Tech Talk
2024-05-29T21
GDG Cloud Southlake #33
Executive Director of Technology Experience and CRM
Founding Organizer
Team Member
Application Developer, Consultant
Co Organizer
Tech Lead
Co Organizer
Co Organizer
GDG Twin Cities
GDG Twin Cities
Executive Director of Technology Experience and CRM
Founding Organizer
Team Member
Application Developer, Consultant
Co Organizer
Tech Lead
Co Organizer
Co Organizer
Workshop / Study Group
2024-06-26T21
Build with AI
Speaker Session / Tech Talk
2024-06-26T14
GDG Cloud Southlake #34
Workshop / Study Group
2024-06-22T13
Build With AI, Fine Tuning Gemini
Workshop / Study Group
2024-06-19T21
Build with AI
Speaker Session / Tech Talk
2024-05-29T21
GDG Cloud Southlake #33

Now, at this time, I can use the direct URL just as easy

$ curl --request GET --url 'https://gdg.community.dev/gdg-twin-cities/' | grep -oE '"event_type_title":"[^"]+"|start_date":"[^"]+"|title":"[^"]+"' | awk -F':' '{print $2}' | tr -d '"'                                                                            % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  470k  100  470k    0     0   723k      0 --:--:-- --:--:-- --:--:--  722k
GDG Twin Cities
Workshop / Study Group
2024-06-26T21
Build with AI
Speaker Session / Tech Talk
2024-06-26T14
GDG Cloud Southlake #34
Workshop / Study Group
2024-06-22T13
Build With AI, Fine Tuning Gemini
Workshop / Study Group
2024-06-19T21
Build with AI
Speaker Session / Tech Talk
2024-05-29T21
GDG Cloud Southlake #33
Executive Director of Technology Experience and CRM
Founding Organizer
Team Member
Application Developer, Consultant
Co Organizer
Tech Lead
Co Organizer
Co Organizer
GDG Twin Cities
GDG Twin Cities
Executive Director of Technology Experience and CRM
Founding Organizer
Team Member
Application Developer, Consultant
Co Organizer
Tech Lead
Co Organizer
Co Organizer
Workshop / Study Group
2024-06-26T21
Build with AI
Speaker Session / Tech Talk
2024-06-26T14
GDG Cloud Southlake #34
Workshop / Study Group
2024-06-22T13
Build With AI, Fine Tuning Gemini
Workshop / Study Group
2024-06-19T21
Build with AI
Speaker Session / Tech Talk
2024-05-29T21
GDG Cloud Southlake #33

We can send our request out from various endpoints

For instance, perhaps i send from China to find i cannot access the local GDG page

$ curl --request GET \
s://api.scraping> --url 'https://api.scrapingant.com/v2/extended?url=https%3A%2F%2Fgdg.community.dev%2Fgdg-twin-cities%2F&x-api-key=d198b446e0233da1&proxy_country=CN'
{"detail":"This site can’t be reached"}

But then I see my own site can:

$ curl --request GET --url 'https://api.scrapingant.com/v2/extended?url=https%3A%2F%2Ffreshbrewed.science&x-api-key=d198b446e87d4126b93932a030233da1&proxy_country=CN'
{"html":"<!DOCTYPE html><html class=\"fontawesome-i2svg-active fontawesome-i2svg-complete\"><head>\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n    <title>Fresh/Brewed - Fresh/Brewed</title>\n    <style type=\"text/css\">svg:not(:root).svg-inline--fa{overflow:visible}.svg-inline--fa{display:inline-block;font-size:inherit;height:1em;overflow:visible;vertical-align:-.125em}.svg-inline--fa.fa-lg{vertical-align:-.225em}.svg-inline--fa.fa-w-1{width:.0625em}.svg-inline--fa.fa-w-2{width:.125em}.svg-inline--fa.fa-w-3{width:.1875em}.svg-inline--fa.fa-w-4{width:.25em}.svg-inline--fa.fa-w-5{width:.3125em}.svg-inline--fa.fa-w-6{width:.375em}.svg-inline--fa.fa-w-7{width:.4375em}.svg-inline--fa.fa-w-8{width:.5em}.svg-inline--fa.fa-w-9{width:.5625em}.svg-inline--fa.fa-w-10{width:.625em}.svg-inline--fa.fa-w-11{width:.6875em}.svg-inline--fa.fa-w-12{width:.75em}... snip....

Pricing

We can see there are various tiers of pricing today

Most requests I notice take 10 credits. So we can roughly translate the lowest tier ‘Enthusiast’ “100,000 API Credits” is 10k requests for US $20/mo.

Scraping Bee

We can signup for Scraping Bee in much the same way. It has similar layout and for a bit I suspected it was the same site with different design aesthetics.

We can use our API key with curl like we did before

$ curl --request GET --url "https://app.scrapingbee.com/api/v1/?api_key=VGJOYRYFNXG4C4X1NFXFOHLP716IUVJJKQH1KWT4M194BQW1KTFKGF9TOJCN2ERAYN0R4D9DB46NUQ54&url=https%3A%2F%2Ffreshbrewed.science"
<!DOCTYPE html><html class="fontawesome-i2svg-pending"><head>
    <meta charset="utf-8"> ... snip ...

We can use Geolocation to set the country code (albeit as a premium proxy)

For example, pulling from Germany:

$ curl --request GET --url "https://app.scrapingbee.com/api/v1/?api_key=VGJOYRYFNXG4C4X1NFXFOHLP716IUVJJKQH1KWT4M194BQW1KTFKGF9TOJCN2ERAYN0R4D9DB46NUQ54&premium_proxy=true&country_code=de&url=https%3A%2F%2Ffreshbrewed.science"
<!DOCTYPE html><html class="fontawesome-i2svg-pending"><head>
    <meta charset="utf-8"> ... snip ..

We can also use an even more expensive (by way of credits) option of a “Stealth Proxy” meant to bypass blocks

They also have “Pure Header Forwarding” that can let us pass headers.

For instance, say i desired to send OTLP tracing spans to New Relic from different locations.

I could use:

curl -X POST --url "https://app.scrapingbee.com/api/v1/?api_key=VGJOYRYFNXG4C4X1NFXFOHLP716IUVJJKQH1KWT4M194BQW1KTFKGF9TOJCN2ERAYN0R4D9DB46NUQ54&url=http%3A%2F%2Fotlp.nr-data.net%2Fheaders%3Fjson&forward_headers_pure=true" \
  -H "Spb-api-key=xxxxxxxxxxxxxxxxxxxxxNRAL" --data @span.json

Note: i did show my key above, but I rotated it after writing so it is now outdated

Pricing

We can see ScrapingBee pricing

So basic queries would be roughly 15k/mo for $50.

Companies

I see a Crunchbase for Scraping Bee pointed to France with 1-10 employees. Looking at Pierre de Wulf, one of the co-founders, profile, it has been around since July 2019.

I was surprised to find the Founder of ScrapingAnt, which according to LinkedIn has 3 employees, is the creation of Oleh Kulyk around Apr 2020.

While similarly named, I think it’s more happenstance than anything.

I found this Scrapeway breakdown that compares the two.

Comparing

I might consider making AWS Lambdas to do the same thing. For AWS, we can get 1mil requests and 888 hours of compute time for free.

Google also makes CloudRun free for the first 180k or 240k cpu seconds a month.

Summary

I actually think both tools offer a compelling product. ScrapingAnt has the cheapest offering of $19/mo but can charge for a blocked request so that cheap price might be comparing like for like. ScrapingBee has a large featureset but lacks unlimited concurrency so the Ant would be better if one has to scale out in parallel often.

Since I use Python and Javascript most often, both are supported as first-class languages in Ant and Bee. Both let me head out of different companies. Since neither let you carry credits, whatever you pay-for is what you get in the month.

If price is the main point, the Bee gives about 300 requests for US$1 compared to 500 for the same price by way of the Ant. As I mentioned, one can also code-your-own with Lambda, CloudRun or Azure Functions as well.

Scraping Ant Signup

Via API key

Pricing

Scraping Bee

Pricing

Companies

Comparing

Summary