Scrape the web with Crawlee
Crawlee is an open source web scraping and browser automation library for Node.js designed for productivity. Made by Apify, the popular web scraping and automation platform.
Crawlee is the successor of Apify SDK and escaped from Apify labs after 4 years of development. While the Apify SDK was still open source, the name of the library led users to think that its functionality was limited to the Apify platform, which was not true. For this reason, the Apify SDK has been split into two libraries, Crawlee and Apify SDK. Crawlee will retain all tools related to crawling and scraping, while at the same time Apify SDK will continue to exist but only retain Apify specific features.
They really put a lot of work into making it a customizable library. For example, you can start with simple HTTP-based scraping, but move to browser-based automation by calling Playwright or Puppeteer under the covers or configure your proxies to avoid being blocked using human-like fingerprinting auto-generated, headless browsers, and proxy rotations.
Crawlee Features
– Single interface for HTTP and headless browser crawling
– Persistent queue for URLs to crawl (width and depth first)
– Pluggable storage of tabular data and files
– Automatic scaling with available system resources
– Integrated proxy rotation and session management
– Customizable life cycles with hooks
– CLI to start your projects
– Configurable routing, error handling and retries
– Dockerfiles ready to deploy
– Written in TypeScript with generics
HTTP Explore
– HTTP2 support without configuration, even for proxies
– Automatic generation of browser-like headers
– Replication of browser TLS fingerprints
– Built-in fast HTML parsers. Cheerio and JSDOM
– Yes, you can also scrape JSON APIs
True Browser Exploration
– JavaScript rendering and screenshots
– Support without head and full head
– Generation without configuration of human-like fingerprints
– Automatic browser management
– Use Playwright and Puppeteer with the same interface
– Chrome, Firefox, Webkit and many more
If you have Node.js installed, you can try out Crawlee by running the command below and choosing one of the available templates for your crawler.
npx crawlee create my-crawler
Then choose from the drop-down list of Typescript or Javascript templates and hit enter on the one you like.
After that, it will automatically generate boilerplate code and also install all the dependencies you need to get started.
After the installation is complete, if you navigate to your new project folder, you will notice that it already contains a bunch of files. a docker file, a package.json and a ts configuration file if you use Typescript. Do
npm start
and you are good to go.
Crawlee is open-source and works anywhere, but since it is developed by Apify, it is easy to set up on the Apify platform and run in the cloud, which is Apify’s main effort.
More information
Crawlee.dev
Crawlee Github
Related Articles
Headless Chrome and the Puppeteer library for scraping and testing the web
To be notified of new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
comments
or send your comment to: [email protected]
Comments are closed.