<noscript>
elements are not easy to handle while parsing HTML. In this guide, I talk about how to parse noscript elements with via Cheerio, while establishing a mental model of they work.
Requirements
Before we begin, make sure you have Node.js and Cheerio installed in your project. You can install Cheerio by running the following command in your project directory:
# NPM
npm i -D cheerio
# Yarn
yarn add -D cheerio
Getting Started
Mental Model
- Treat
noscript
as a HTML in itself, hence gets a different cheerio selector for it - You'd have to manage all noscript cheerio selectors differently than the page's cheerio selector
- You can switch between the two and use the .html() to replace/add HTML to any of the selector
Code
Now, let's start with the steps to parse elements with Cheerio:
1. Import the necessary dependencies:
// File: index.js
const { load } = require('cheerio');
2. Load the HTML content that contains the elements:
// File: index.js
const html = `<html>
<body>
<noscript>
This is a noscript element.
<img id="noscript-image-1" src="/assets/img-1.png" />
<img id="noscript-image-2" src="/assets/img-2.png" />
</noscript>
</body>
</html>`;
const $ = load(html);
3. Treat noscript element as an independent HTML in itself:
// File: index.js
const noscriptElements = $("noscript");
noscriptElements.each((index, element) => {
const $2 = load($(element).html());
});
4. Process the cheerio parent relative to the noscript element in the loop via $2:
// File: index.js
const imgElements = $2("img");
imgElements.each((index, imgElement) => {
const src = $2(imgElement).attr("src");
const id = $2(imgElement).attr("id");
console.log("Image Source:", src);
console.log("Image ID:", id);
});
5. Full Code
// File: index.js
const { load } = require("cheerio");
const html = `<html>
<body>
<noscript>
This is a noscript element.
<img id="noscript-image-1" src="/assets/img-1.png" />
<img id="noscript-image-2" src="/assets/img-2.png" />
</noscript>
</body>
</html>`;
const $ = load(html);
const noscriptElements = $("noscript");
noscriptElements.each((index, element) => {
const $2 = load($(element).html());
const imgElements = $2("img");
imgElements.each((index, imgElement) => {
const src = $2(imgElement).attr("src");
const id = $2(imgElement).attr("id");
console.log("Image Source:", src);
console.log("Image ID:", id);
});
});
Summary
In te code, we first locate all elements using the $('noscript') selector. Then, we iterate over each element and treat them as their own HTML and parse it again with cheerio. With these two looking alike selectors (here $ and $2) it becomes easier to switch between noscript HTML and the original HTML containing the noscript elements.
The current steps above demonstrate how to locate and access elements inside tags using Cheerio.
You can adjust the code to handle multiple elements or perform additional processing based on your requirements.