Strip quoted text from html emails

The context

For a project I’m currently working on, I needed to build such a timeline:

MicroFactures client messages

The idea: we get the user emails (sent and received) and we put them in a pretty timeline (the mail pipe to my script will perhaps be discussed later, in another post…).

The problem, when we show emails that are replies of another message, they contain the quoted text which spoil the timeline.

So I started to google for solutions to remove those quoted texts, thinking that it would be a very easy thing, that there would be an RFC defining exactly how those quoted texts should be identified.

I discovered two things:

  1. The only convention about html emails is this W3C Note which, apparently, is not really respected by everyone (anyone?).
  2. This is such a common problem that Google filed a patent for this.

So, after a lot of searches, I decided to start my own PHP solution which, for now, fits all my needs and, perhaps, could make you save some time.

The solution

A script which handles the main email providers and clients: GMail, Outlook, Yahoo, OS X Mail, Thunderbird and Roundcube… I know this is not an exhaustive list but it will be completed regularly 😉

Let’s define a method wich accepts a message body string as argument:

<?php function strip_quotes_from_message($message_body) ?>

Each provider/client has its own manner to add quoted texts (no, really ???), so we start by defining them:

<?php
$els_to_remove = [
  'blockquote', // Standard quote block tag
  'div.moz-cite-prefix', // Thunderbird
  'div.gmail_extra', 'div.gmail_quote', // Gmail
  'div.yahoo_quoted' // Yahoo
];
?>

Then we use an HTML parser to detect and delete them:

Now let’s handle Outlook:

And now Roundcube:

Note that we handle the french version of Roundcube quoted text, but this can be easily adapted to any other language.

By removing the quoted parts, we get sometimes some empty tags… let’s remove them:

We can now return the cleaned message body:

<?php return $dom->root->innerHtml(); ?>

The complete code is available here.

Conclusion

I know that trying to handle all the quoted texts versions is not the best solution but it gives me, for now, pretty good results.

Do not hesitate to comment or share your solution…

Ressources:

 

 

Leave a Reply