Identifying textual inconsistencies in 30-years of news posts
An effort to build an opinionated super-Smartypants.
I’m working on a project that inherits about thirty years of blog posts, which, understandably, contain some typographic oddities that Smartypants won’t fix.
They also contain some more difficult stuff. The sort of things that fall somewhere between frivolous sub-editing tweaks and accessibility measures I really ought to take.
The project? Well, this is an expression of a problem, not a fix.
The business of taking a set of 1990s HTML files and a 15-year-old Wordpress database and turning them into a set of Markdown files has already stripped out the most esoteric bits of custom styling. For now at least, I’m using remark-react to turn Markdown into HTML, so I started stacking up unified plugins to fix easy things at build time.
Smartypants sorts out Windows users’ straight quotes and hypens. There are several options for fixing broken Wordpress oEmbeds. And then there’s textr. Textr makes it all too easy to write plugins that regex every imaginable annoyance into submission.
Then that started to seem a bit stupid. Why replace several thousand ‘things that offend Will’ on every build when we could replace several thousand ‘things that offend Will’ once and save them to the source material?
So we’re going to build that. But first, I thought I’d catalogue all the things I want to fix. So here’s that.
Multiply, subtract and divide
- 2 x 2 - 2 X 2 - 2 * 2 + 2 × 2 # -> 2 × 2
The multiply sign is an ×, the lower case x just looks like it. In this project’s context, we also have to contend with rowing boat designations, which should be a sleak 4×−, but are generally the less sleek 4x-.
Don’t forget that
− (−) isn’t a hyphen and
÷ (÷) is only a Google away.
Sub- and superscripts
- 12 m<sup>2</sup> + 12 m² # -> 12 m²
<sup> HTML tags are unpredictable browser-styling botchery. System fonts have real superiors and inferiors now, so let’s use them.
We like 100 m² more than 100 m2.
Which brings us to typewriter butchery.
- 100 sq m - 100 sqm - 100 msq - 100 m2 - 100 m^2 - 100 msquared + 100 m² # -> 100 m²
French punctuation in languages that aren’t French
- Is that all ? + Is that all?
If the language is French, it ought to be a non-breaking-space (
) to keep the punctuation mark from becoming orphaned. (An outcome that sounds more tragic than it is.)
C’est tout ?
Ampersands in paragraphs
- There are plots & counterplots. + There are plots and counterplots.
The ampersand belongs to dreadfully twee things like Mumford & Sons, not comprehensible text.
Single-word numerals in paragraphs
- “I have 1 or 2 things to attend to…” + “I have one or two things to attend to…”
- “She has 1,000,000,000,000 husbands.” + “She has a trillion husbands.”
One-thousand-two-hundred-and-thirty-fours in paragraphs are a little clunky, so I’ll permit the odd 1,234.
Underlining for emphasis on screens
In that context, underlined text means ‘link’ not ‘important’.
<u> is now the ‘Unarticulated Annotation’ element, so it’s explicitly not meant to be used to underline text or to indicate emphasis. For our purposes, swapping all
<em>s is not an unreasonable policy.
<b>s and uninformed
<strong>s for headings
- <p><strong>Heading</strong></p> - <p><b>Heading</b></p> + <h#>Heading</h#>
We use semantically-appropriate heading tags for accessibility, and it’s pretty achievable to regex
<p><b>Heading<b><p> and swap it for a hierarchically appropriate
Spaces between numbers and units
- 36km - 36 km + 36 km
This may seem especially nitpicky, but this particular oeuvre is full of these sorts of measurements and it’s desirable to make them more consistent.
- i. first list item <br /> - ii. second list item <br /> - iii. third list item <br /> + <ol type="i"> + <li>first list item</li> + <li>second list item</li> + <li>third list item</li> + </ol>
On the page, the difference is between:
i. first list item
ii. second list item
iii. third list item
- first list item
- second list item
- third list item
Implied image captions
- <img src="thing.jpg" /> - <p>A picture of a thing.</p> - <h4>A picture of a thing.</h4> - <img src="thing.jpg" /> + <figure> + <img src="thing.jpg" alt="A picture of a thing." width="…" /> + <figcaption>A picture of a thing.</figcaption> + </figure>
- Full results are available <a>here</a> + <a>Full results are available here</a>
In the former example, screenreaders will give users just ‘here’ as link text, which isn’t much use. Arbitrarily expanding the anchor tag could be a limited, but useful, improvement – in this case to cover the whole sentence: ‘Full results are available here’.
Dates with a th (partially opinionated)
# Basic bad - On 29th May 1999 # Worse - On 29<sup>th</sup> May 1999 # Bad but wrong - On 29 May 1999 # Correct + On <time datetime="1999-05-29">29 May 1999</time>
I may be on my own, but I think native English-speakers will infer that they’re meant to read ‘twenty-ninth’ without the ‘th’. Wrapping it in
<sup> just makes it uglier, but automatically adding a
<time> tag would be satisfying.
Shorthand ordinal numerals in body text are a separate but similar issue.
- She finished in 1<sup>st</sup> place. + She finished in first place.
American English abbreviations (in languages that aren’t American English)
# e.g. Mister - Mr. Bernard + Mr Bernard # e.g. Saint - St. Bernard + St Bernard
Round here, the dot in abbreviations indicates that letters have been cut from the end of the original word, so if the last letter of the abbreviation is the last letter of the word we’re abbreviating, there’s no need to indicate its absence with a full stop.
German capitalisation (in languages that aren’t German)
People like to capitalise nouns that feel important to them. I strongly doubt it’s worth trying to automate a correction to this.
- My Diesel Car has run out of Petrol. + My diesel car has run out of petrol. + My diesel Citröen has run out of petrol. + The pedant: My diesel Citröen has run out of diesel.
The goal here might be to automatically lowercase all mid-sentence words that aren’t proper nouns. And there’s some traction here: there are some okay lists of English proper nouns about.
Still, ‘only capitalise words that follow a full stop or appear in this list’ sounds like a recipe for a mess. Plus, in all the sensibly-sized lists I tried, the third or fourth proper noun I searched was missing. Even Wiktionary misses a few. At minimum though, I could isolate the worst offenders in this body of text and correct them.
The degree symbol
- 15C No: 15oC - 15<sup>o</sup>C + 15°C # -> 15°C
The degree symbol (°) is not easy to type: Alt + 248 on Windows, and ⌥ Option + ⇧ Shift + 8 on Mac. Still, it’s not a superscript lowercase
Let’s say we’re trying to express that someone finished a race in twelve minutes and six seconds.
- No: Finished in 12.06 /* 12.06 min = 12 min 3.6 sec */ # Maybe: Finished in 12:06 /* 12:06 pm? 12 hrs 6 mins? */ # Maybe: Finished in 0:12:06 /* ok… */ # Maybe: Finished in 12′ 6″ /* concise, but a bit archaic */ # Maybe: Finished in 12 m 6 s /* metres and seconds? */ # Maybe: Finished in 12 min 6 sec + Harrumph: Finished in 12 minutes 6 seconds /* where we started */
The colon definitively trumps the dot, but I’m not sure about this. So long as there’s enough context to disambiguate it, I’m happy with
mm:ss (it helps if the minute value is greater than twenty-four: 57:02 isn’t the time of day).
If there’s not enough context,
mm minutes ss seconds is the least ambiguous way to go, but it’s pretty wordy.
For some reason, Wordpress’s TinyMCE implementation had a massive blockquote button in its toolbar, which tempted editors into using it for things that weren’t really blockquotes.
<blockquote>Just an aside or some sort of emphatic note.</blockquote>
It definitely seems wrong to misuse the
<blockquote> element, but some fairly respectable organisations do it, so maybe it’s not.
<aside>Just an aside or some sort of emphatic note.</aside>
Or, even better:
<details> <summary>Title</summary> Just an aside or some sort of emphatic note. </details>
I have three fixes:
The other problem? ‘Proper’ blockquotes, with all the attributions baked in, are pretty syntax heavy. None of the ‘true’ blockquotes in this dataset use this syntax.
<figure> <blockquote cite="https://gutenberg.org/files/1289/1289-0.txt"> <p> Then he went on. “I have no peace or rest for it. It calls to me, for many minutes together, in an agonised manner, ‘Below there! Look out! Look out!’ It stands waving to me. It rings my little bell—” </p> </blockquote> <figcaption>Charles Dickens, <cite>The Signalman</cite>, 1866</figcaption> </figure>
With a bit of styling, that can work out nicely.
Well into Smartypants territory here, but here’s a limitation I hadn’t noticed. Smartypants will convert:
- As if he's bothered -- no one cares. + As if he’s bothered — no one cares.
That creates a ‘word word — word word’ situation, which you might see in books and looks good (but maybe it’s a bit try-hard on a screen).
There are two problems here. In this corpus, the input is almost never a double hyphen, and the desired output for British English is an N-dash, not an M-dash. So what I really want is:
- As if he's bothered - no one cares. As if he's bothered -- no one cares. - As if he's bothered--no one cares. + As if he’s bothered – no one cares.
There are more coming.