Giorgio Delgado

Soup

July 30, 2015

As I mentioned in a previous post, I was looking to switch over from the big and bulky Wordpress content management system to the light and more technical jekyll static-file blogging platform.

This came about in part because the hosting service I was using to manage my Wordpress site was frustratingly unreliable, so I saw switching hosting (now using GitHub Pages) as an opportunity for a complete revamp.

Initial Intentions

My initial post mentioned that Jekyll has a library for easily transferring over files from WordPress to Jekyll-ready format but I sort of saw that as a lost learning opportunity, which is why I started looking into Scrapy.

Scrapy would have crawled trough my old blog's pages searching for specific .class or #id tags in order to pull out blog posts, pages, and media.

Scrapy No Worky

After reading through the Scrapy documentation, I was unable to solve an issue regarding the lxml library. And frankly, the simplicity of what I was trying to achieve did not justify the effort to figure out the solution to a problem that had already consumed a few hours too many.

While looking for solutions to this error I did, however, start getting closer to the solution. I decided to ditch Scrapy and do site-wide XML export of my WordPress content. The XML export was disorganized to say the least, so it's not like I was cheating myself out by not using a web scraper. Beautiful Soup was the next library I attempted to use in order to parse the XML and extract my desired content and output jekyll-ready HTML files containing the required front matter.

I started extracting content but then found an even simpler solution via a code snippet.

More precisely ... this small piece:

def convert(infile, outdir, authorDirs, categoryDirs):

"""Convert Wordpress Export File to multiple html files.



Keyword arguments:

infile -- the location of the Wordpress Export File

outdir -- the directory where the files will be created

authorDirs -- if true, create different directories for each author

categoryDirs -- if true, create directories for each category



"""





# First we parse the XML file into a list of posts.

# Each post is a dictionary



dom = minidom.parse(infile)



blog = [] # list that will contain all posts



for node in dom.getElementsByTagName('item'):

post = dict()



post["title"] = node.getElementsByTagName('title')[0].firstChild.data

post["date"] = node.getElementsByTagName('pubDate')[0].firstChild.data

post["author"] = node.getElementsByTagName(

'dc:creator')[0].firstChild.data

post["id"] = node.getElementsByTagName('wp:post_id')[0].firstChild.data



if node.getElementsByTagName('content:encoded')[0].firstChild != None:

post["text"] = node.getElementsByTagName(

'content:encoded')[0].firstChild.data

else:

post["text"] = ""



# wp:attachment_url could be use to download attachments



# Get the categories

tempCategories = []

for subnode in node.getElementsByTagName('category'):

tempCategories.append(subnode.getAttribute('nicename'))

categories = [x for x in tempCategories if x != '']

post["categories"] = categories



# Add post to the list of all posts

blog.append(post)

What? No need for Beautiful Soup to parse XML? Python comes with an XML parsing library straight out of the box (or at least some version do from my understanding).

As an aside, if you haven't already realized, the Beautiful Soup library is where the name came from for this project, even though I didn't end up using the library at the end of the name. It just sort of stuck.

The above code snippet was just a skeleton for what the end result would be. By the end of the project I had:

So How Does It Actually Work?

here's the repository

soup.py iterates through each item XML element.

I then generalized a certain rule that was consistent across all posts, which was all blog posts had both a non-empty XML element of title and content:encoded. Once I found one such item element that contained those two elements within it, I proceeded to save the metadata to an array which contained dictionaries holding all post content and metadata.

Otherwise, if I came across a item tag containing a non-empty wp:attachment_url element then it must mean that it contains media content meta data. Using urllib I went ahead and downloaded the content that the metadata was pointing to on the live site.

Once I iterated through the entire XML DOM then I would begin formatting.

From WordPress to Jekyll

I defined a function that would handle the processing of an array of dictionaries containing messy post data and outputting the data into a Jekyll compatible directory structure.

Jekyll has a few rules with how posts must be formatted for them to be rendered correctly:

- posts must have "yyyy-mm-dd-this-is-the-title.md" format, although it does allow for .html file extensions. And I wasn't going to start converting my WordPress post HTML into MarkDown.

- posts require front matter, which is metadata telling Jekyll how to structure your post (i.e. what tags, if any, it should have, or the layout of the post, etc).

So one by one, post_create checks whether the current dictionary contains a regular blog post or a page post by ignoring unnecessary pages ("Contact" & "Contact form 1") as well as not touching special pages ("About", "Forty- Eight") so that I may pass them different font matter.

This nifty line:

jekyll_dir = '_drafts/' if year == '-0001' else '_posts/'

routes the posts to either the drafts folder or posts folder depending on whether they were set as draft or published on WordPress (WP identifies drafts by setting the publish date on drafts to "-0001").

After the posting of the front matter and the creation of the file extensions are done then all that's left to do is to actually write the file it's intended directory.

Voila.

Liked this post? Consider buying me a coffee :)