Difference between revisions of "How-to create a Python scraper"

From openZIM
Jump to navigation Jump to search
(Add information about evidences needed)
m (Add a section about changing title of Vue.js UI)
 
(4 intermediate revisions by 2 users not shown)
Line 16: Line 16:


=== Best practices ===
=== Best practices ===
A Python scraper should ideally:
This section contains best practices for developing scrapers related to different aspects of developing a scraper.


- adhere to openZIM's [https://github.com/openzim/overview/wiki/Contributing Contribution Guidelines] and implement openZIM's [https://github.com/openzim/_python-bootstrap/docs/Policy.md Python bootstrap, conventions and policies]
==== Development ====
A Python scraper should ideally match other openZIM scrapers by:


- be hosted on Github under the openzim [https://github.com/openzim/ organization] (we can create you a repo there on request)
* adhering to openZIM's [https://github.com/openzim/overview/wiki/Contributing Contribution Guidelines]
* implementing openZIM's [https://github.com/openzim/_python-bootstrap/docs/Policy.md Python bootstrap, conventions and policies]
* being hosted on Github under the openzim [https://github.com/openzim/ organization]
* using utilities from [https://github.com/openzim/python-scraperlib/ python-scraperlib] ([https://pypi.org/project/zimscraperlib/ zimscraperlib] on PyPi) to create the ZIM and create a consistent experience


- use the [https://github.com/openzim/python-scraperlib/ python-scraperlib] ([https://pypi.org/project/zimscraperlib/ zimscraperlib] on PyPi) to create the ZIM (and there are many useful utilities as well)
==== Scraper User Experience ====
The scraper will mostly be used by individuals who curate their own ZIM collections, and Zimfarm maintainers. To help them, the scraper should:


- reencode images and videos so that the final ZIM size is (by default at least) moderate
* be configurable with CLI flags, especially for ZIM metadata (title, description, tags, ...) and filename
* validate all flags and metadata as early as possible to avoid spending time fetching online resources only to find they cannot produce a ZIM
* implement proper logging with various log levels (error, warning, info, debug)
* create logs that are easy to understand e.g. avoiding filenames in logs by using the [https://github.com/openzim/python-scraperlib/blob/main/src/zimscraperlib/logging.py logging package in zimscraperlib]


- cache these reencoded assets on an S3 bucket (we can provide you with a dev bucket on request) so that scraper avoids to loose time / computing resources reencoding them at every ZIM update
==== Supporting Zimfarm ====
Scrapers should be good citizens when operating in shared infrastructure such as Zimfarm by:


- be configurable with CLI flags, especially for ZIM metadata (title, description, tags, ...) and filename
* re-encoding images and videos so that the final ZIM size is (by default at least) moderate
* caching re-encoded assets on an S3 bucket (we can provide you with a dev bucket on request) to avoid wasting compute time
* avoiding unnecessary filesystem access i.e. prefer to add items to the ZIM on-the-fly rather than staging them on the FS then creating the ZIM
* consuming as few resources as possible (CPU time, disk IOs, disk space, memory, network, ...)
* implement a [https://github.com/openzim/gutenberg/blob/f3440861d45e46a50aeea2e79255ede16aee9121/src/gutenberg2zim/export.py#L293 task progress JSON file] so Zimfarm can report status


- validate all these metadata as early as possible to avoid spending time fetching online resources and transforming them only to realize in the end that metadata are not valid and we cannot produce a ZIM
=== How to develop a nice UI to run inside the ZIM ===
Original scrapers are using Jinja2 to render HTML files dynamically and add them to the ZIM. We are currently migrating to another approach where the UI running inside the ZIM is a Vue.JS project. We are not yet certain which approach is best. Vue.JS allows to quickly built very dynamic interfaces in a clean way, where Jinja2 approach usually relied on "crappy" JS based on JQuery and stuff like that. However Vue.JS comes with a probably more limited set of supported browsers and induces a more steep learning curve to contribute on scrapers. [https://github.com/openzim/freecodecamp/ Freecodecamp] and [https://github.com/openzim/youtube/ Youtube] scraper are already using this Vue.JS approach. [https://github.com/openzim/kolibri/ Kolibri] scraper has began to migrate both stuff is still stuck in a v2 branch.
 
==== Adding legacy browser support with vite-plugin-legacy ====
The [https://www.npmjs.com/package/@vitejs/plugin-legacy @vitejs/plugin-legacy] plugin allows us to add support for legacy browsers in production builds that lack newer features. Refer to the code block below on how to edit vite.config.js to set up support for browsers that [https://browsersl.ist/#q=fully+supports+es6 fully support ES6].<syntaxhighlight lang="typescript" line="1">
// vite.config.js
import legacy from '@vitejs/plugin-legacy'
 
export default defineConfig({
  // Add the plugin and configure as shown below
  plugins: [
    vue(),
    legacy({
      targets: ['fully supports es6'],
      modernPolyfills: true
    }),
})
</syntaxhighlight>
 
==== Adding ogv.js as fallback for video.js in Vue.js ====
To use ogv.js as a fallback in Video.js for playing WebM videos in older browsers that don't support WebM (like Safari), follow these steps:
 
# Install [https://www.npmjs.com/package/ogv ogv] and [https://www.npmjs.com/package/vite-plugin-static-copy vite-plugin-static-copy]
# Add '''viteStaticCopy''' plugin in your vite.config.js :<syntaxhighlight lang="javascript">
// vite.config.js
import { viteStaticCopy } from 'vite-plugin-static-copy'
 
export default defineConfig({ 
  plugins: [
    // ..
    viteStaticCopy({
      targets: [
        {
          src: 'node_modules/ogv/dist/*',
          dest: 'assets/ogvjs'
        }
      ]
    })
  ],
})
</syntaxhighlight>The '''vite-plugin-static-copy''' plugin copies the OGV library files to the assets folder in the production build, as Video.js needs these files at runtime.
# Copy [https://github.com/openzim/youtube/blob/853b9245aa5af552110be205b94c6c0e804bf865/zimui/src/plugins/videojs-ogvjs.js this] file into your project. This custom videojs-ogvjs tech plugin enables Video.js to use ogv.js as a fallback in Vue.js.
# Import the tech plugin in your project and modify your video.js options:<syntaxhighlight lang="javascript">
import '@/plugins/videojs-ogvjs.js'
 
const videoOptions = ref({
  // ...
  techOrder: ['html5', 'ogvjs'], // Set ogvjs as fallback to html5
  ogvjs: {
    base: './assets/ogvjs'
  },
  sources: [
    {
      src: videoURL,
      type: videoFormat // Make sure to specify the correct video format (E.g. - "video/webm")
    }
  ],
})
</syntaxhighlight>'''Note:''' Ensure that the ogvjs "base" path here matches the "dest" path you set for viteStaticCopy in step 2.
 
==== Polyfilling WebP Images ====
To polyfill WebP images you can use [https://www.npmjs.com/package/webp-hero webp-hero].
 
If you are using <code><img></code> HTML elements use the code below to do the polyfilling:<syntaxhighlight lang="html">
<script>
import { onMounted, ref } from 'vue'
import { WebpMachine } from 'webp-hero'
import 'webp-hero/dist-cjs/polyfills.js'
 
const thumbnailRef = ref(null)
 
// Polyfill after the img element is loaded into the DOM
onMounted(() => {
  if (!thumbnail.value) return
  try {
    const webpMachine = new WebpMachine()
    webpMachine.polyfillImage(thumbnail.value)
  } catch (error) {
    console.error('Error polyfilling WebP:', error)
  }
})
</script>
 
<template>
<img
  ref="thumbnailRef"
  src="path-to-img"
/>
</template>
</syntaxhighlight>If you are using a special image component like <code><v-img></code> from a UI library like [https://vuetifyjs.com/en/ Vuetify], refer to this [https://github.com/openzim/youtube/blob/853b9245aa5af552110be205b94c6c0e804bf865/zimui/src/plugins/webp-hero.ts helper function] and it [https://github.com/openzim/youtube/blob/853b9245aa5af552110be205b94c6c0e804bf865/zimui/src/components/video/VideoCard.vue#L37-L42 usage] in the YouTube scraper.
 
==== Indexing Custom Content for Suggestions/Search ====
Since we are not using HTML templates with the Vue.js UI, the content stored in ZIM files will no longer be automatically indexed. To manually index special content refer to code below:<syntaxhighlight lang="python3">
from libzim.writer import IndexData
from zimscraperlib.zim import Creator, StaticItem
 
class CustomIndexData(IndexData):
    """Custom IndexData class to allow for custom title and content"""
 
    def __init__(self, title: str, content: str):
        self.title = title
        self.content = content


- avoid as much on possible to rely on the filesystem, i.e. prefer to add items to the ZIM on-the-fly rather than arranging every files on the filesystem and adding them to the ZIM only in a final stage
    def has_indexdata(self):
        return True


- consume as little resources as possible (CPU time, disk IOs, disk space, RAM memory, ...)
    def get_title(self):
        return self.title


- implement proper logging with various log levels (error, warning, info, debug)
    def get_content(self):
        return self.content


- implement a task progress JSON file so that integration in Zimfarm will be smoother
    def get_keywords(self):
        return ""


=== How to develop a nice UI to run inside the ZIM ===
    def get_wordcount(self):
Original scrapers are using Jinja2 to render HTML files dynamically and add them to the ZIM. We are currently migrating to another approach where the UI running inside the ZIM is a Vue.JS project. We are not yet certain which approach is best. Vue.JS allows to quickly built very dynamic interfaces in a clean way, where Jinja2 approach usually relied on "crappy" JS based on JQuery and stuff like that. However Vue.JS comes with a probably more limited set of supported browsers and induces a more steep learning curve to contribute on scrapers. [https://github.com/openzim/freecodecamp/ Freecodecamp] scraper is already using this Vue.JS approach. [https://github.com/openzim/youtube/ Youtube] scraper is currently migrating to this approach. [https://github.com/openzim/kolibri/ Kolibri] scraper has began to migrate both stuff is still stuck in a v2 branch.
        return len(self.content.split()) if self.content else 0
       
def add_custom_item_to_zim_index(
    self, title: str, content: str, fname: str, zimui_redirect: str, creator: Creator
):
    """add a custom item to the ZIM index"""
 
    redirect_url = f"../index.html#/{zimui_redirect}"
    html_content = (
        f"<html><head><title>{title}</title>"
        f'<meta http-equiv="refresh" content="0;URL=\'{redirect_url}\'" />'
        f"</head><body></body></html>"
    )
 
    item = StaticItem(
        title=title,
        path="index/" + fname,
        content=bytes(html_content, "utf-8"),
        mimetype="text/html",
    )
    item.get_indexdata = lambda: CustomIndexData(title, content)
 
    logger.debug(f"Adding {fname} to ZIM index")
    creator.add_item(item)
</syntaxhighlight><code>add_custom_item_to_zim_index</code>  takes a title and a content body and stores it in the ZIM as an HTML file with a redirect to the provided redirect URL (zimui_redirect). This will then show up in the search results when searching for it in the Kiwix readers.
 
==== Changing title of the Vue.js UI ====
The title of the UI is set in the index.html file in the Vue.js UI directory. To change this to the title of the scraped content, you can edit the index.html file using RegEx just before adding it to ZIM as follows:<syntaxhighlight lang="python3">
index_html_path = self.zimui_dist / "index.html" # Path where index.html is located
html_content = index_html_path.read_text(encoding="utf-8")
 
# Replace title using RegEx
new_html_content = re.sub(
    r"(<title>)(.*?)(</title>)",
    rf"\1{self.title}\3",
    html_content,
    flags=re.IGNORECASE,
)
 
# Add the modified file to the ZIM
self.zim_file.add_item_for(
    path=path,
    content=new_html_content,
    mimetype="text/html",
    is_front=True,
)
</syntaxhighlight>


=== Additional resources ===
=== Additional resources ===

Latest revision as of 12:58, 18 August 2024

This page is a high level outline about the considerations you need to create a new scraper that produces ZIM files that can be used in Kiwix or other compatible readers.

Developing the scraper

  1. Decide what resource you want to create a scraper for.
    1. If the resource is a website, check to see if https://zimit.kiwix.org/ works.
    2. Makes sure none of the existing scrapers work for your use-case.
  2. Decide how you'll want to implement the scraper and put together a proposal including the following and submit a request in the zim-requests repository so the community can give you feedback and create a repository for you if needed (Example). Some questions you might want to answer in the request are:
    • Information about the resource you want to scrape.
      • Especially evidences the future scraper will be in-line with openZIM content purpose and publishing policies
        • Sometimes this is obvious, sometimes it needs some explanation, especially regarding content licensing.
        • If the scraper is not inline with these requirements, this is not a show-stopper at all but we will probably not host this scraper under the openzim umbrella, and will probably dedicate less efforts to support you (but you are still free to create a scraper, all our software stack is free software).
    • Why create a new scraper versus using one that already exists (or increasing the capabilities of a closely related existing scraper).
    • A rough sketch of your proposal.
  3. Implement the scraper using the Python bootstrap repository as a basis.

Best practices

This section contains best practices for developing scrapers related to different aspects of developing a scraper.

Development

A Python scraper should ideally match other openZIM scrapers by:

Scraper User Experience

The scraper will mostly be used by individuals who curate their own ZIM collections, and Zimfarm maintainers. To help them, the scraper should:

  • be configurable with CLI flags, especially for ZIM metadata (title, description, tags, ...) and filename
  • validate all flags and metadata as early as possible to avoid spending time fetching online resources only to find they cannot produce a ZIM
  • implement proper logging with various log levels (error, warning, info, debug)
  • create logs that are easy to understand e.g. avoiding filenames in logs by using the logging package in zimscraperlib

Supporting Zimfarm

Scrapers should be good citizens when operating in shared infrastructure such as Zimfarm by:

  • re-encoding images and videos so that the final ZIM size is (by default at least) moderate
  • caching re-encoded assets on an S3 bucket (we can provide you with a dev bucket on request) to avoid wasting compute time
  • avoiding unnecessary filesystem access i.e. prefer to add items to the ZIM on-the-fly rather than staging them on the FS then creating the ZIM
  • consuming as few resources as possible (CPU time, disk IOs, disk space, memory, network, ...)
  • implement a task progress JSON file so Zimfarm can report status

How to develop a nice UI to run inside the ZIM

Original scrapers are using Jinja2 to render HTML files dynamically and add them to the ZIM. We are currently migrating to another approach where the UI running inside the ZIM is a Vue.JS project. We are not yet certain which approach is best. Vue.JS allows to quickly built very dynamic interfaces in a clean way, where Jinja2 approach usually relied on "crappy" JS based on JQuery and stuff like that. However Vue.JS comes with a probably more limited set of supported browsers and induces a more steep learning curve to contribute on scrapers. Freecodecamp and Youtube scraper are already using this Vue.JS approach. Kolibri scraper has began to migrate both stuff is still stuck in a v2 branch.

Adding legacy browser support with vite-plugin-legacy

The @vitejs/plugin-legacy plugin allows us to add support for legacy browsers in production builds that lack newer features. Refer to the code block below on how to edit vite.config.js to set up support for browsers that fully support ES6.

// vite.config.js
import legacy from '@vitejs/plugin-legacy'

export default defineConfig({
  // Add the plugin and configure as shown below
  plugins: [
    vue(),
    legacy({
      targets: ['fully supports es6'],
      modernPolyfills: true
    }),
})

Adding ogv.js as fallback for video.js in Vue.js

To use ogv.js as a fallback in Video.js for playing WebM videos in older browsers that don't support WebM (like Safari), follow these steps:

  1. Install ogv and vite-plugin-static-copy
  2. Add viteStaticCopy plugin in your vite.config.js :
    // vite.config.js
    import { viteStaticCopy } from 'vite-plugin-static-copy'
    
    export default defineConfig({  
      plugins: [
        // ..
        viteStaticCopy({
          targets: [
            {
              src: 'node_modules/ogv/dist/*',
              dest: 'assets/ogvjs'
            }
          ]
        })
      ],
    })
    
    The vite-plugin-static-copy plugin copies the OGV library files to the assets folder in the production build, as Video.js needs these files at runtime.
  3. Copy this file into your project. This custom videojs-ogvjs tech plugin enables Video.js to use ogv.js as a fallback in Vue.js.
  4. Import the tech plugin in your project and modify your video.js options:
    import '@/plugins/videojs-ogvjs.js'
    
    const videoOptions = ref({
      // ...
      techOrder: ['html5', 'ogvjs'], // Set ogvjs as fallback to html5
      ogvjs: {
        base: './assets/ogvjs'
      },
      sources: [
        {
          src: videoURL,
          type: videoFormat // Make sure to specify the correct video format (E.g. - "video/webm")
        }
      ],
    })
    
    Note: Ensure that the ogvjs "base" path here matches the "dest" path you set for viteStaticCopy in step 2.

Polyfilling WebP Images

To polyfill WebP images you can use webp-hero.

If you are using <img> HTML elements use the code below to do the polyfilling:

<script>
import { onMounted, ref } from 'vue'
import { WebpMachine } from 'webp-hero'
import 'webp-hero/dist-cjs/polyfills.js'

const thumbnailRef = ref(null)

// Polyfill after the img element is loaded into the DOM
onMounted(() => {
  if (!thumbnail.value) return
  try {
    const webpMachine = new WebpMachine()
    webpMachine.polyfillImage(thumbnail.value)
  } catch (error) {
    console.error('Error polyfilling WebP:', error)
  }
})
</script>

<template>
<img
  ref="thumbnailRef"
  src="path-to-img"
/>
</template>

If you are using a special image component like <v-img> from a UI library like Vuetify, refer to this helper function and it usage in the YouTube scraper.

Indexing Custom Content for Suggestions/Search

Since we are not using HTML templates with the Vue.js UI, the content stored in ZIM files will no longer be automatically indexed. To manually index special content refer to code below:

from libzim.writer import IndexData
from zimscraperlib.zim import Creator, StaticItem

class CustomIndexData(IndexData):
    """Custom IndexData class to allow for custom title and content"""

    def __init__(self, title: str, content: str):
        self.title = title
        self.content = content

    def has_indexdata(self):
        return True

    def get_title(self):
        return self.title

    def get_content(self):
        return self.content

    def get_keywords(self):
        return ""

    def get_wordcount(self):
        return len(self.content.split()) if self.content else 0
        
def add_custom_item_to_zim_index(
    self, title: str, content: str, fname: str, zimui_redirect: str, creator: Creator
):
    """add a custom item to the ZIM index"""

    redirect_url = f"../index.html#/{zimui_redirect}"
    html_content = (
        f"<html><head><title>{title}</title>"
        f'<meta http-equiv="refresh" content="0;URL=\'{redirect_url}\'" />'
        f"</head><body></body></html>"
    )

    item = StaticItem(
        title=title,
        path="index/" + fname,
        content=bytes(html_content, "utf-8"),
        mimetype="text/html",
    )
    item.get_indexdata = lambda: CustomIndexData(title, content)

    logger.debug(f"Adding {fname} to ZIM index")
    creator.add_item(item)

add_custom_item_to_zim_index takes a title and a content body and stores it in the ZIM as an HTML file with a redirect to the provided redirect URL (zimui_redirect). This will then show up in the search results when searching for it in the Kiwix readers.

Changing title of the Vue.js UI

The title of the UI is set in the index.html file in the Vue.js UI directory. To change this to the title of the scraped content, you can edit the index.html file using RegEx just before adding it to ZIM as follows:

index_html_path = self.zimui_dist / "index.html" # Path where index.html is located
html_content = index_html_path.read_text(encoding="utf-8")

# Replace title using RegEx
new_html_content = re.sub(
    r"(<title>)(.*?)(</title>)",
    rf"\1{self.title}\3",
    html_content,
    flags=re.IGNORECASE,
)

# Add the modified file to the ZIM
self.zim_file.add_item_for(
    path=path,
    content=new_html_content,
    mimetype="text/html",
    is_front=True,
)

Additional resources