First of all, web scraping is the phenomenon of fetching/extracting contents and data from an existing website with the help of bots. In this blog, we will be using puppeteer package to scrape data. Puppeteer is headless chrome node API. It is an automation tool.
Get started
We are going to scrape book details from books website in this blog.
-To get started, i assume that you already have setup a new NestJs project.
If you already have a NestJs project running then you can go ahead and install puppeteer package and necessary dependencies as follows:
$npm install puppeteer
$npm install nest-puppeteer
-After that, you can create scraper.service.ts and scraper.controller.ts files in your src directory. Or you can simply run the following command :
$nest g resource scraper
-Inside scraper.service.ts you can write the logic code for scraping data as follows:
//scraper.service.ts
import { Injectable } from '@nestjs/common';
import * as puppeteer from 'puppeteer';
@Injectable()
export class ScraperService {
async scrape() {
const url = 'https://books.toscrape.com/';
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
await page.goto(url);
// Use Puppeteer to scrape the website
const bookData = await page.evaluate((url) => {
const bookPods = Array.from(document.querySelectorAll('.product_pod'));
const data = bookPods.map((book: any) => ({
title: book.querySelector('h3 a').getAttribute('title'),
rating: book.querySelector('.star-rating').classList[1],
price: book.querySelector('.product_price .price_color').innerHTML,
imgSrc: url + book.querySelector('img').getAttribute('src'),
stock: $(book).find('.instock.availability').text().trim(),
}));
return data;
}, url);
console.log(bookData);
await browser.close();
}
}
Now, in your scraper.controller.ts you should have following code that defines the route:
//scraper.controller.ts
import { Controller, Get, Param } from '@nestjs/common';
import { ScraperService } from './scraper.service';
@Controller('scraper')
export class ScraperController {
constructor(private readonly scraperService: ScraperService) {}
@Get()
async scrape() {
return this.scraperService.scrape();
}
}
You can run these codes to scrape the data by following command:
$npm run start
And in your browser, hit the following endpoint:
localhost:PORT/scraper
This is it for web scraping. If you want to make it more interesting and store the scraped data in database using prisma, you can modify the code in scraper.service.ts. But before that, i assume that you are familiar with prisma and relations among models in prisma. If not, visit my previous blogs:
If you are already familiar then you can go ahead and continue:
In prisma.schema you must have a model named Book as follow:
generator client {
provider = "prisma-client-js"
}
datasource db {
provider = "postgresql"
url = env("DATABASE_URL")
}
model Book{
id Int @id @default(autoincrement())
title String
rating String
price String
imgSrc String
stock String
}
You have to migrate the model to be in sync with the database.
Now, the last step is to modify our scraper.service.ts file by adding some for loop. Your scraper.service.ts file must look like this:
import { Injectable } from '@nestjs/common';
import * as puppeteer from 'puppeteer';
import { PrismaService } from 'src/prisma/prisma.service';
@Injectable()
export class ScraperService {
constructor(private readonly prismaService: PrismaService){}
async scrape() {
const url = 'https://books.toscrape.com/';
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
await page.goto(url);
// Use Puppeteer to scrape the website
const bookData = await page.evaluate((url) => {
const bookPods = Array.from(document.querySelectorAll('.product_pod'));
const data = bookPods.map((book: any) => ({
title: book.querySelector('h3 a').getAttribute('title'),
rating: book.querySelector('.star-rating').classList[1],
price: book.querySelector('.product_price .price_color').innerHTML,
imgSrc: url + book.querySelector('img').getAttribute('src'),
stock: $(book).find('.instock.availability').text().trim(),
}));
return data;
}, url);
for (const book of bookData) {
await this.prismaService.book.create({
data: {
title: book.title,
rating: book.rating,
price: book.price,
imgSrc: book.imgSrc,
stock: book.stock,
},
});
}
console.log(bookData);
await browser.close();
}
}
You can run it again by:
$npm run start:dev
Hit the endpoint localhost:PORT/scraper in your browser. You can check your data in prisma studio.
$npx prisma studio