Index ¦ Archives ¦ Atom

Post-build DOM manipulation with pyquery

Modern Javascript toolchains have a habit of confusing me. Digging deep enough into package.json & node_modules I can sometimes figure out why something is behaving the way it is, but that's not my idea of fun.

In my current side project I'm using Choo, and actually quite enjoying it, as far as Javascript frameworks go. But I found that Bankai wasn't behaving as required when mounting my Choo app onto a non-default index.html (specifically, one that defined <meta> tags in the <head>), so was looking at how to change that.

Now, I could have made some changes to Bankai & addressed it there, but... it's part of a Javascript toolchain, it sounded a bit complicated, and side projects are supposed to be fun.

Instead, I decided to add a post-build step that manipulated the Bankai-generated index.html directly. And because it's post-build, I don't have to use Javascript! After a false start with Python 3's html.parser, I found pyquery - basically, jQuery for Python.

Without further ado (well, after a quick $ pip install pyquery in your venv of choice):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/usr/bin/env python
#
# broken <head>? instead of messing with bankai, let's do this post-build...

import sys
from pyquery import PyQuery as pq

head_elements = [
    '<meta name="description" content="Enter description here..." />',
    '<meta property="og:url" content="https://example.com" />',
    '<meta property="og:title" content="App Name" />',
    '<meta property="og:description" content="Enter description here..." />',
    '<meta property="og:image" content="https://example.com/assets/cover.jpg" />',
    '<meta name="twitter:card" content="summary" />',
    '<meta name="twitter:site" content="@twitterhandle" />',
    '<meta name="twitter:image" content="https://example.com/assets/cover.jpg" />',
    '<title>App Name</title>',
    # of course, we could add more here...
    ]

filename = sys.argv[1]

print('Reading original HTML from file: {}'.format(filename))
with open(filename, 'r') as fin:
    html = fin.read()
    d = pq(html)

    # remove title & meta-desc tags that bankai doesn't set correctly
    d('title').remove()
    # jQuery-style attributeContains selector!
    d('meta[name="description"]').remove()

    # add new tags to head
    for he in head_elements:
        print('  Adding element to <head>:\n\t{}'.format(he))
        d('head').append(he)

with open(filename, 'w') as fout:
    fout.write('<!doctype html>\n')
    # pyquery/lxml was converting <script></script>'s into <script />, outer_html was necessary
    fout.write(d.outer_html())

print('Wrote updated HTML to file: {}'.format(filename))

To add this to the build process, I updated package.json:

  "scripts": {
    "build": "bankai build index.js; ./bin/tweak-dom.py ./dist/index.html",

Now, an npm run build gets me an index.html with the headers I need in the right places, ready to deploy. Like I said, probably not really how you're supposed to do things, but we're busy Getting Things Doneā„¢ here.

© Jamie Finnigan. Built using Pelican. Modified from theme by Giulio Fidente on github.