Posted on:
I've been working on a Python script that takes the contents of Te Ara and puts them into a compressed archive, after being inspired by how the entirety of the English Wikipedia's text and the Simple English Wikipedia's text can be downloaded as 18GB and 201MB archives respectively (as at March 2021), and wanting to have my very own copy of Aotearoa's history.
There were a variety of bugs along the way - I was missing calls to .replace()
to handle some Unicode character conversions in my file path sanitisation function, and my browser and IDE were confusing me by showing the same representations for an em dash and a minus, which wasn't helping.
I had implemented multiprocessing to take care of saving each article as a PDF once the sitemap had been scraped into a list and all was running well at the time - it was 4x faster and blazing its way through them.
However, a seemingly small bug came up in the tail end of execution. It was giving a FileNotFoundError
for ./section_places/southland/stewart_island/rakiura.pdf
This confused me to no end for 6 hours - at first, I thought it was an issue with the list of scraped articles and there was an invalid item in it, and then I thought it was a race condition because of how the multiprocessing interacted with filesystem changes. I started putting guard conditions in to wait for directory creation before moving on to saving the PDF, and when that didn't work I tried using the Lock class that the multiprocessing library provided, but to no avail - and on top of that my print statements weren't showing and the debugger really wasn't helping much.
I tried filtering the scraped article list down to just the item for Stewart Island and my print statements also started working then (something to do with the processes is my guess) - and it turned out that the article title had a forward slash in it (Stewart Island/Rakiura
). Since I was using pathlib so that my paths could be POSIX compliant regardless of OS, it was interpreting Stewart Island
as the directory containing the file, while not actually being created before saving the PDF. I made one small change to my sanitisation function to replace the slash with an underscore and it was working again.
I feel like this highlighted the importance of 'small sanitisation' for me - that is, sanitisation that's not considered critical by usual good practice conventions - like correctly parameterising SQL statements that take user input on a web server, or escaping HTML to prevent XSS is. It's the small cases that can often be quite difficult - you could have a non-standard specification to conform to with no public libraries to help you, and while data in this case is most likely not user-submitted, you might need to update your specifications each time your dataset changes. Hell, you might even have the odd wild goose chase through your code every now and then. Garbage in, garbage out and all that.
Stay safe, and keep on keeping those strings nice and clean.
Tagged with:
More posts: