How to scrap dynamic webpage with Node Red?

Linxiu Jiang
4 min readFeb 5, 2021

1. Introduction

In the last blog, I introduced the way to scrap a dynamic web page (bilibili.com) with python. In this blog, let me show you a new tool — Node Red. I will still scrap the information on bilibili.com.

The preparation is to install Node Red on your laptop. If this is the first time for you to install it, do not worry. It is so simple to get it, just check the official link as follows: https://nodered.org/docs/getting-started/local#installing-with-npm

Ok, now we can start our story.

2. Scrap a static webpage

First, let us start with scraping a static webpage, say the homepage of bilibili.com. A static webpage, for example, means you do not have to scroll the mouse to get the new response, and all the DOM can be responded at one request. Our target is to get all the users shown on the homepage as I mark with a red rectangle.

First step: build a program structure to let client connect server. This is very simple, I use inject, http request, http parse, and debug blocks:

Second step: create suitable css selector to acquire html information that you want. If you are not familiar with css selectors as me, refer to this article: https://www.w3schools.com/cssref/css_selectors.asp

We can see the output on the console. Our scraper only scrap the static page information without new response.

3. Scrap a dynamic webpage

Now, I would like to get the other asynchronized responses from the web page. Here are the steps to scrap a dynamic web page:

First step: build a Node Red program structure. I use inject, http request, json parse, function, and debug blocks:

I will explain each block afterwards.

Second step: Get URL from source code of web page, change the critical information inside of it.

url = api.bilibili.com/x/web-interface/dynamic/region?ps=50&rid=1

In this demo, the critical information in URL is “ps”, which is the display number. And for bilibili, it can be at most 50. It is worth to mention that we no longer use the URL of bilibili portal.

Third step: send request using this URL and proxy(optional) to the Server.

I still use http request block to send request to server.

Fourth step: receive the data from server response and deal with them.

I use json block and function block to deal with data. json block is to assure that the data have already parsed to json object. The function is to extract the data we want.

The data structure looks like this:

The data include 50 objects and each object has many attributes such as aid, tid, videos, owner and so on. I want to extract data about owner and videos, so the codes are as follows:

I traversal each json object in the response array, and extract all the data I want into a new json object, then push this object into the result array.

After this, I can get all the data about videos and the owner:

Final step: store the data in files or feed to database to do further analysis.

We have already mentioned this step in the previous article, please check: https://jiang-linx-5844.medium.com/scraping-web-data-with-tor-7287544cddbe to get more information.

4. Closing

In this article, I introduced the utility of node red by a simple scrapper demo. If you have any question, feel free to commit or contact me by linkedin: linkedin.com/in/linxiu-frances-jiang-961986117

--

--