Submitting a POST web form, scrapping with cheerio, and hosting on AWS lambda.
- In this article, I will show how to submit a web form, and scrape its response.
- Benefits of hosting a scrapper on AWS lambda
- Normally if you search for scrapping with nodejs, you will be redirected to tools using browser automation like puppeteer, but for some use cases like this one, cheerio can be used.
“Almost everything you can do with a browser (save for interpolating and running JavaScript) can be done with simple Linux tools. Cheerio and others libraries offer elegant Node API’s for fetching data via HTTP requests and scraping if that’s your end-goal. “
Benefits of using cheerio :
1. Performance improvement ( In my use-case request time decreased from 2000–5000 ms to 50–500 ms)
2. No extra resource usage due to a browser
3. Puppeteer needs to be carefully handled as only a limited number of instances can be run at a single time. No such requirement with request-promise+cheerio
Submitting a web form with POST request.
We will be using URL built for scrapping — Testing ground
Sending the post request
If we see in this URL, there are some instructions and then a form you can submit..
- How to send POST Request?
We need to analyze how the client side sends this post request to its server. So we will fill this form, click login and open the network tab. Open the top request, and we see this.
URL: http://testing-ground.scraping.pro/login?mode=login
So they send POST request to this URL and status received is 200 OK. Scrolling down….
FormData!
So we see they send user as “usr” and password as “pwd”.
- Sending our own post request using request-promise
Now, the response we get is the entire HTML of the website!
Scrapping the data using cheerio
Cheerio : Fast, flexible & lean implementation of core jQuery designed specifically for the server.
Usage is similar to Jquery and hence $ represents the entire HTML of response.
When we do “.error”, we select the error class and retrieve the text in it.
If we see our console,
- In the bottom, we see all the HTML elements with class “error” texts
What if we wanted a particular element?
- To get the text of a particular HTML element, we go to its webpage and inspect it. Then copy its selector.
For Example: To get only access denied, when “a” is sent as user and password.
Then we use this selector with cheerio and voila!
Benefits of AWS Lambda
Apart from the main benefit of “AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume”, AWS Lambda changes its IP making it more beneficial to use in cases where the IP might get blocked by the server.
Cases when IP Changes :
1. When code/config changes
2. After some time of inactivity (seems random time, sometimes even less than 10mins)
3. When simultaneously multiple requests are sent, it spawns more than one instance and hence multiple IPs are seen.
It runs on containers, and new containers are spawned according to the above use cases. That is why its IP changes. Remember, you can’t depend on a container being reused, since it’s Lambda’s prerogative to create a new one instead.
“AWS Lambda is not the same as an EC2 instance as it Traffic would appear to be coming from certain IP addresses but there is no way to configure which IP address is used meaning that the IP address that the requests are sent from will not be the same.”
Deploying: https://www.twilio.com/blog/2017/09/serverless-deploy-nodejs-application-on-faas-without-pain.html
- This approach works great for particular use cases like above. In this way browser automation can be avoided. Else puppeteer or some other browser tool might be the solution!
Entire NodeJS script for reference.
References :
1. https://docs.browserless.io/blog/2018/06/04/puppeteer-best-practices.html