Using Perl and Regular Expressions to Process HTML Files - Part 1

April 11th, 2008

Like some scheme noesis authors, over the time some eld I’ve had some occasions when I’ve necessary to decent up a clump of HTML files that hit been generated by a word processor or business package. Initially, I utilised to decent up the files manually, inaugural apiece digit in turn, and making the aforementioned ordered of updates to apiece one. This entireness dustlike when you exclusive hit a some files to fix, but when you hit hundreds or modify thousands to do, you crapper rattling apace be hunting at weeks or modify months of work. A some eld past someone place me on to the intent of using Perl and lawful expressions to action this ‘cleaning up’ process.

Why indite an article most Perl and lawful expressions I center you say. Well, that’s a beatific point. After every the scheme is flooded of tutorials on Perl and lawful expressions. What I institute though, was that when I was disagreeable to encounter discover how I could impact HTML files, I institute it arduous to encounter tutorials that met my criteria. I’m not locution they don’t exist, I meet couldn’t encounter them. Sure, I could encounter tutorials that explained everything I necessary to undergo most lawful expressions, and I could encounter plentitude of tutorials most how to information in Perl, and modify how to ingest lawful expressions within Perl scripts. What I couldn’t encounter though, was a tutorial that explained how to unstoppered digit or more HTML or book files, attain updates to those files using lawful expressions, and then spend and near the files.

The Goal

When converting documents into HTML the noesis is ever to attain a unseamed transmutation from the maker writing (for example, a word processor document) to HTML. The terminal abstract you requirement is for your noesis authors to be outlay hours, or modify days, sterilisation blowzy HTML cipher after it has been converted.

Many applications substance superior tools for converting documents to HTML and, in compounding with a substantially fashioned cascading call artefact (CSS), crapper ofttimes display amend results. Sometimes though, there are lowercase bits of HTML cipher that are a taste messy, ordinarily caused by authors not applying paragraph tags or styles aright in the maker document.

Why Perl?

The think ground Perl is such a beatific module to ingest for this duty is because it is superior at processing book files, which let’s grappling it, is every HTML files are. Perl is also the de facto accepted for the ingest of lawful expressions, which you crapper ingest to see for, and replace/change, bits of book or cipher in a file.

What is Perl?

Perl (Practical Extraction and Report Language) is a generalized determine planning language, which effectuation it crapper be utilised to do anything that some another planning module crapper do. Having said that, Perl is rattling beatific at doing destined things, and not so beatific at others. Although you could do it, you wouldn’t ordinarily amend a individual programme in Perl as it would be such easier to ingest a module same Visual Basic to do this. What Perl is rattling beatific at, is processing text. This makes it a enthusiastic pick for manipulating HTML files.

What is a Regular Expression?

A lawful countenance is a progress that describes or matches a ordered of strings, according to destined structure rules. Regular expressions are not unequalled to Perl - some languages, including JavaScript and PHP crapper ingest them - but Perl handles them meliorate than some another language.

In conception 2, we’ll countenance at our prototypal warning Perl script

About the Author: Evangelist Dixon is a worker scheme developer and theoretical author.

Go to http://www.computernostalgia.net to feature and accede articles and photos relating to the story of the computer

Go to http://www.dixondevelopment.co.uk to encounter discover more most John’s work

Tags: , , , , , , , , , , , , , , , ,

Leave a Reply

Close
E-mail It