Yes, I’m finally gearing towards the release of my
html2text.php successor, dubbed Markdownify. I’m using exessive testing and utilize the MDTest suite to find potential regressions etc. I’m really enjoying to program little CLI scripts with PHP, it just works like a charm.
Here’s an example of how my test suite currently looks like:
To the left is the original input (HTML), in the middle you find the generated Markdown and to the right HTML again - but now generated via PHP Markdown by Michel Fortin. The pretty colors mark changes between the two HTML versions. I use PEAR Text_Diff for this and a little of my own code. But since all of the existing diff engines for Text_Diff took ages for the Markdown Documentation (~400 lines afair), I wrote a Text_Diff engine which utilizes
[shell_exec](http://www.php.net/shell_exec)() and GNU diff. This is blazingly fast and works like a charm! You can get the source code over at pastebin.org. Also take a look on the feature request I made. Dunno if this was the correct place for that…
I’ve just released a second Markdownify Beta with better PHP 4 support and some other small bug fixes. You can download it from sourceforge.
Finally I’ve completed the Markdownify website. Also I’ve released the first beta, here the news text from SourceForge:
This is the first beta release of Markdownify - the HTML to Markdown converter for PHP.
It is very stable and should handle nearly all features of Markdown and Markdown Extra syntax. Missing are only two things:
- “Markdown inside block elements” for Markdownify Extra
These two things will be added before the first “stable” release. Additionally some performance improvements will hopefully be added.
You are encouraged to use this release in your web applications. Please let me know if you find any bugs. Also a code review by anyone would be very much appreciated!
Download it now
A few days ago I started a complete rewrite of html2text. It now uses a new htmlparser (also written by me) which should make the whole HTML cleanup process obsolete. The generic XML parser which is currently used dies on invalid XHTML, with my parser it should be possible do handle errors and parse HTML 4.01 documents without any regex magic beforehand.
You’ll hear more of this in about a week as I’ll be on vacation until the 24th.