LaunchFast Logo Introducing LaunchFa.st - Astro boilerplate
Friday, May 26 2023

Parsing noscript Elements using Cheerio in Node.js

Rishi Raj Jain
Rishi Raj Jain @rishi_raj_jain_

<noscript> elements are not easy to handle while parsing HTML. In this guide, I talk about how to parse noscript elements with via Cheerio, while establishing a mental model of they work.

Requirements

Before we begin, make sure you have Node.js and Cheerio installed in your project. You can install Cheerio by running the following command in your project directory:

# NPM
npm i -D cheerio

# Yarn
yarn add -D cheerio

Getting Started

Mental Model

  • Treat noscript as a HTML in itself, hence gets a different cheerio selector for it
  • You'd have to manage all noscript cheerio selectors differently than the page's cheerio selector
  • You can switch between the two and use the .html() to replace/add HTML to any of the selector

Code

Now, let's start with the steps to parse noscript elements with Cheerio:

1. Import the necessary dependencies:

// File: index.js

const { load } = require('cheerio')

2. Load the HTML content that contains the noscript elements:

// File: index.js

const html = `<html>
    <body>
        <noscript>
            This is a noscript element.
            <img id="noscript-image-1" src="/assets/img-1.png" />
            <img id="noscript-image-2" src="/assets/img-2.png" />
        </noscript>
    </body>
</html>`

const $ = load(html)    

3. Treat noscript element as an independent HTML in itself:

// File: index.js

const noscriptElements = $('noscript')

noscriptElements.each((index, element) => {
  const $2 = load($(element).html())
})

4. Process the cheerio parent relative to the noscript element in the loop via $2:

// File: index.js

const imgElements = $2('img')

imgElements.each((index, imgElement) => {
    const src = $2(imgElement).attr('src')
    const id = $2(imgElement).attr('id')
    console.log('Image Source:', src)
    console.log('Image ID:', id)
})

5. Full Code

// File: index.js

const { load } = require("cheerio")

const html = `<html>
    <body>
        <noscript>
          This is a noscript element.
          <img id="noscript-image-1" src="/assets/img-1.png" />
          <img id="noscript-image-2" src="/assets/img-2.png" />
        </noscript>
    </body>
</html>`

const $ = load(html)

const noscriptElements = $('noscript')

noscriptElements.each((index, element) => {
  const $2 = load($(element).html())
  const imgElements = $2('img')
  imgElements.each((index, imgElement) => {
    const src = $2(imgElement).attr('src')
    const id = $2(imgElement).attr('id')
    console.log('Image Source:', src)
    console.log('Image ID:', id)
  })
})

Summary

In te code, we first locate all noscript elements using the $('noscript') selector. Then, we iterate over each noscript element and treat them as their own HTML and parse it again with cheerio. With these two looking alike selectors (here $ and $2) it becomes easier to switch between noscript HTML and the original HTML containing the noscript elements.

The current steps above demonstrate how to locate and access img elements inside noscript tags using Cheerio.

You can adjust the code to handle multiple noscript elements or perform additional processing based on your requirements.

Write a comment

Email will remain confidential.