Post-build DOM manipulation with pyquery

I suspect this isn't really how you're supposed to do things, but hey.. works for me

Modern JavaScript toolchains have a habit of confusing me. Digging deep enough into package.json & node_modules I can sometimes figure out why something is behaving the way it is, but that's not my idea of fun.

In my current side project I'm using Choo, and actually quite enjoying it, as far as JavaScript frameworks go. But I found that Bankai wasn't behaving as required when mounting my Choo app onto a non-default index.html (specifically, one that defined <meta> tags in the <head>), so was looking at how to change that.

I also needed to add a <noscript> tag, so make for a more graceful failure mode when a user loaded the application with JavaScript disabled.

Now, I could have made some changes to Bankai & addressed it there, but... it's part of a JavaScript toolchain, it sounded a bit complicated, and side projects are supposed to be fun.

Instead, I decided to add a post-build step that manipulated the Bankai-generated index.html directly. And because it's post-build, I don't have to use JavaScript! After a false start with Python 3's html.parser, I found pyquery - basically, jQuery for Python.

Without further ado (well, after a quick $ pip install pyquery in your venv of choice):

#!/usr/bin/env python
#
# instead of messing with bankai, let's do this post-build...

import sys
from pyquery import PyQuery as pq

head_elements = [
    '<meta name="description" content="Enter description here..." />',
    '<meta property="og:url" content="https://example.com" />',
    '<meta property="og:title" content="App Name" />',
    '<meta property="og:description" content="Enter description here..." />',
    '<meta property="og:image" content="https://example.com/assets/cover.jpg" />',
    '<meta name="twitter:card" content="summary" />',
    '<meta name="twitter:site" content="@twitterhandle" />',
    '<meta name="twitter:image" content="https://example.com/assets/cover.jpg" />',
    '<title>App Name</title>',
    # of course, we could add more here...
    ]

noscript = '<noscript>Your browser appears to have JavaScript disabled. Sorry, but we need JavaScript to run.<br/><br/>Please consider enabling JavaScript for this domain.<br/><br/>For more information on this application, browse to <a href="https://example.com/welcome/">https://example.com/welcome/</a>.</noscript>'

filename = sys.argv[1]

print('Reading original HTML from file: {}'.format(filename))
with open(filename, 'r') as fin:
    html = fin.read()
    d = pq(html)

    # remove title & meta-desc tags that bankai doesn't set correctly
    d('title').remove()
    # jQuery-style attributeContains selector!
    d('meta[name="description"]').remove()

    # add new tags to head
    for he in head_elements:
        print('  Adding element to <head>:\n\t{}'.format(he))
        d('head').append(he)

    # add noscript to body
    print('  Adding <noscript> content to <body>')
    d('body').append(noscript)

with open(filename, 'w') as fout:
    fout.write('<!doctype html>\n')
    # pyquery/lxml was converting <script></script>'s into <script />, outer_html was necessary
    fout.write(d.outer_html())

print('Wrote updated HTML to file: {}'.format(filename))

To add this to the build process, I updated package.json:

  "scripts": {
    "build": "bankai build index.js; ./bin/tweak-dom.py ./dist/index.html",

Now, an npm run build gets me an index.html with the headers I need in the right places, along with a useful <noscript> tag, ready to deploy. Like I said, probably not really how you're supposed to do things, but we're busy Getting Things Done™ here.

Posted on: Fri 05 April 2019

Category: tech – Tags: tech