This came about in part because the hosting service I was using to manage my Wordpress site was frustratingly unreliable, so I saw switching hosting (now using GitHub Pages) as an opportunity for a complete revamp.
My initial post mentioned that Jekyll has a library for easily transferring over files from WordPress to Jekyll-ready format but I sort of saw that as a lost learning opportunity, which is why I started looking into Scrapy.
Scrapy would have crawled trough my old blog's pages searching for specific
#id tags in order to pull out blog posts, pages, and media.
Scrapy No Worky
After reading through the Scrapy documentation, I was unable to solve an issue
lxml library. And frankly, the simplicity of what I was trying
to achieve did not justify the effort to figure out the solution to a problem
that had already consumed a few hours too many.
While looking for solutions to this error I did, however, start getting closer to the solution. I decided to ditch Scrapy and do site-wide XML export of my WordPress content. The XML export was disorganized to say the least, so it's not like I was cheating myself out by not using a web scraper. Beautiful Soup was the next library I attempted to use in order to parse the XML and extract my desired content and output jekyll-ready HTML files containing the required front matter.
I started extracting content but then found an even simpler solution via a code snippet.
More precisely ... this small piece:
def convert(infile, outdir, authorDirs, categoryDirs): """Convert Wordpress Export File to multiple html files. Keyword arguments: infile -- the location of the Wordpress Export File outdir -- the directory where the files will be created authorDirs -- if true, create different directories for each author categoryDirs -- if true, create directories for each category """ # First we parse the XML file into a list of posts. # Each post is a dictionary dom = minidom.parse(infile) blog =  # list that will contain all posts for node in dom.getElementsByTagName('item'): post = dict() post["title"] = node.getElementsByTagName('title').firstChild.data post["date"] = node.getElementsByTagName('pubDate').firstChild.data post["author"] = node.getElementsByTagName( 'dc:creator').firstChild.data post["id"] = node.getElementsByTagName('wp:post_id').firstChild.data if node.getElementsByTagName('content:encoded').firstChild != None: post["text"] = node.getElementsByTagName( 'content:encoded').firstChild.data else: post["text"] = "" # wp:attachment_url could be use to download attachments # Get the categories tempCategories =  for subnode in node.getElementsByTagName('category'): tempCategories.append(subnode.getAttribute('nicename')) categories = [x for x in tempCategories if x != ''] post["categories"] = categories # Add post to the list of all posts blog.append(post)
What? No need for Beautiful Soup to parse XML? Python comes with an XML parsing library straight out of the box (or at least some version do from my understanding).
As an aside, if you haven't already realized, the Beautiful Soup library is where the name came from for this project, even though I didn't end up using the library at the end of the name. It just sort of stuck.
The above code snippet was just a skeleton for what the end result would be. By the end of the project I had:
Added the ability to only extract posts of any sort (be they a standard blog post or a page post).
Added code to download and save any media content and save it to a Jekyll compatible directory.
Defined a function
post_createwhich handled the creation of a single post, be it a regular post or a page.
So How Does It Actually Work?
soup.py iterates through each
item XML element.
I then generalized a certain rule that was consistent across all posts, which
was all blog posts had both a non-empty XML element of
content:encoded. Once I found one such
item element that contained those
two elements within it, I proceeded to save the metadata to an array which
contained dictionaries holding all post content and metadata.
Otherwise, if I came across a
item tag containing a non-empty
wp:attachment_url element then it must mean that it contains media content
meta data. Using
urllib I went ahead and downloaded the content that the
metadata was pointing to on the live site.
Once I iterated through the entire XML DOM then I would begin formatting.
From WordPress to Jekyll
I defined a function that would handle the processing of an array of dictionaries containing messy post data and outputting the data into a Jekyll compatible directory structure.
Jekyll has a few rules with how posts must be formatted for them to be rendered correctly:
- posts must have "yyyy-mm-dd-this-is-the-title.md" format, although it does allow for .html file extensions. And I wasn't going to start converting my WordPress post HTML into MarkDown.
- posts require front matter, which is metadata telling Jekyll how to structure your post (i.e. what tags, if any, it should have, or the layout of the post, etc).
So one by one,
post_create checks whether the current dictionary contains a
regular blog post or a page post by ignoring unnecessary pages ("Contact" &
"Contact form 1") as well as not touching special pages ("About", "Forty-
Eight") so that I may pass them different font matter.
This nifty line:
jekyll_dir = '_drafts/' if year == '-0001' else '_posts/'
routes the posts to either the drafts folder or posts folder depending on whether they were set as draft or published on WordPress (WP identifies drafts by setting the publish date on drafts to "-0001").
After the posting of the front matter and the creation of the file extensions are done then all that's left to do is to actually write the file it's intended directory.