Using regular expressions to extract content

PHP provides a number of really neat regular expression functions. You can find the list of the regex function at the PHP site.

But the one that I’ve had most fun with is the preg_match_all() function which I’ve been using to do content extraction from an HTML page.

Baby BlocksI’m not going to explain what Regular Expression (regex) is in this post. There are whole books on just this one topic along; I would be crazy to think I can explain it all in just a few paragraphs. But in order for you to understand how to use the regex functions you need to have a basic understanding of regular expressions.

If you think back to your childhood days, you would remember a toy that you can match holes with shapes with the corresponding blocks – like the picture here. Well, regular expressions is very much like that toy, but instead you have define your own ’shape’ (or pattern as it’s known) and apply your content to it. Any text that matches the pattern will ‘fall’ through it.

Let’s say you have a block of text like below and you want to extract out the all links from, you can use preg_match_all to do just that.

1
2
3
4
5
$content = "He's goin' everywhere, 
<a href=\"http://www.bjmckay.com\">B.J. McKay</a> and his 
best friend Bear. Rollin' down to 
<a href=\"http://www.dallas.net\">Dallas</a>, who's providin' 
my palace, off to New Orleans or who knows where."

The pattern you want to look for would be the link anchor pattern, like <a href=”(something)”>(something)</a>. The actual regular expression might look something like

1
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

Once you have your pattern you apply the $content and $regex_pattern to preg_match_all() like this

1
2
preg_match_all($regex_pattern,$content,$matches);
print_r($matches);

preg_match_all will store all the matches into the array $matches, so if you output the array, you’ll see something like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Array
(
    [0] => Array
        (
            [0] => <a href="http://www.bjmckay.com">B.J. McKay</a>
            [1] => <a href="http://www.dallas.net">Dallas</a>
        )
 
    [1] => Array
        (
            [0] => http://www.bjmckay.com
            [1] => http://www.dallas.net
        )
 
    [2] => Array
        (
            [0] => B.J. McKay
            [1] => Dallas
        )
)

From this array, $matches, you should be able to loop through and get the information you need.

I hope this has been useful to you. I know it doesn’t cover all the things this function can do, but for first-timers, it should be a simple look at a very powerful PHP function.

Incidently, PHP also provides the function preg_match(). The difference is preg_match() only matches a single instance of the pattern, whereas preg_match_all() tries to find all matching instances within the content.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • DZone
  • Propeller
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz
Posted on April 22, 2008 at 7:03 pm by Eldee · Permalink
In: PHP Tutorials · Tagged with: , ,

20 Responses

Subscribe to comments via RSS

  1. Written by Dave Doyle on April 23, 2008 at 9:17 pm
    Permalink

    While I agree that regular expressions are powerful, is this such a great idea? This presupposes that the first attribute in an anchor tag is the href… which may not be the case ( it could be a class, or some industrious web designer put some onClick or style tags first ).

    I’m actually a Perl guy and the commonly accepted practice is that it’s bad to use a regex to parse HTML. I see there’s a PEAR package called XML_HTMLSax. Wouldn’t that be more appropriate?

    I know it seems a bit much to use a parser just to extract links, but you’re guaranteed to get what you expect.

  2. Written by webmaster on April 24, 2008 at 12:34 pm
    Permalink

    Hi Dave,

    Thanks for your comment. Agreed that it’s difficult to build a regex to parse all the different ways HTML can be written.

    This post is really more about using the regex function in PHP. The parsing of links is really just an example to show how it can be done.

    Thanks for letting me know about the XML_HTMLSax package. I didn’t know there’s a package like that. Learn something new everyday! :)

  3. Written by Thomas Brown on May 14, 2008 at 11:22 pm
    Permalink

    It would be more practical to sit down and actually think of the regular expression you would need before preaching how good they are. You can thank me later but I threw this together for you and as far as I can tell it works to your needs.

    ((?#full link)\x3C/?a ?.*?href=”((?#address).*?)”.*?\x3E ?/?((?#text).*?)\x3c/?a\x3e)

    Enjoy.

    Ps. Never underestimate the power of grep Dave.

  4. Written by Thomas Brown on May 14, 2008 at 11:40 pm
    Permalink

    Another favour for the perl guy:

    #!perl

    ######################################
    ###### Link Shitter ######
    ###### Thomas Brown ######
    ######################################

    # This program takes an XML file as its input and
    # shits out all the links referenced in the page.

    use strict;
    use warnings;

    # Open file
    open (IN, $ARGV[0]) or die $!;

    my $doc = ;

    # Find em and shit em
    my $line;

    while(){
    $line = $_;
    $line =~ s#(\x3C/?a ?.*?href=”(.*?)”.*?\x3E ?/?(.*?)\x3c/?a\x3e)##g;
    print $2 . “\n”;
    }

    close IN;

  5. Written by Thomas Brown on May 14, 2008 at 11:47 pm
    Permalink

    It ate my IN’s haha

  6. Written by Thomas Brown on May 22, 2008 at 8:44 pm
    Permalink

    One other thing I should have told you guys about when I posted is the WWW::Mechanize library. Its for perl but basically you can rip all the links in a page into an array something like this:

    my $mech = WWW::Mechanize->new();
    $mech->get(“http://www.askaboutphp.com/”);
    my @links = $mech->links;

    Although this thread is about regex’s, just thought you might be interested based on the topic of this post.

  7. Written by Joe on January 2, 2009 at 8:53 pm
    Permalink

    When I get an rss feed and display on my page. I would like to extract the page contents when the user clicks on the one of the links and display it only page. Any ideas on how to extact page contents give a the target url?.
    Thanks in advance

  8. Written by webmaster on January 5, 2009 at 11:25 am
    Permalink

    Hi Joe,
    Take a look at the cURL library in PHP. It’s what I use to extract content from a page.
    http://www.php.net/manual/en/book.curl.php

  9. Written by Eric Dixon on May 14, 2009 at 7:51 pm
    Permalink

    This expression appears (to my relatively untutored eye) to be greedy – it matches everything from the first <a href=” to the last , and makes quite a mess. Works fine when you’ve only got one link in your text but is borked for multiple links.

    I replaced it, eventually, with /[^<]+/ instead which only provides the single <a href=”…. entry in a one-dimensional array but it does at least handle multiple links in one content string…

  10. Written by Eric Dixon on May 14, 2009 at 7:52 pm
    Permalink

    Argh, tags got processed. You might want to stop your comment posting from doing that so people can show code more easily.

  11. Written by webmaster on May 18, 2009 at 10:31 am
    Permalink

    hi Eric, sorry about that. Erm…, how do i stop that from happening in wordpress?
    thanks

  12. Written by Tim on May 28, 2009 at 5:47 am
    Permalink

    I am having a major problem, with my web site. I edited a php file, and now that page will not load. The rest of the site works fine, just that page. I have backup of those php files, so i reloaded that page, still does not work. Is there other files that are capable of corrupting when updating a php file.

    Thanks so much

    Tim

  13. Written by Tim on May 28, 2009 at 6:46 am
    Permalink

    thanks

    but i figured it out

  14. Written by nimtronican on June 23, 2009 at 5:29 pm
    Permalink

    Thanks a lot man… it was really helpful!

  15. Written by Denisa on September 4, 2009 at 6:54 pm
    Permalink

    Hi! I have a question…How can I extract only some links whici have a specific title?? Is it possible to do this? I want to obtain the link from this
    Next »
    so, only the a href with the title= “Go to Next Page”.
    Thank very much! Maybe you can help me.
    Denisa

  16. Written by Fahad on January 4, 2010 at 10:43 am
    Permalink

    Thank you man that helped me a lot

  17. Written by Mustafa Rampurawala on June 5, 2011 at 3:36 am
    Permalink

    well this looks to be quite old post, but just in case some one like me comes here searching for a preg_match_all code for extracting links, here is what i finally came up with.

    $regex = ‘#((?:(?!).)*)#i’;

    $sting = “”; //get your data from anywhere and any how you like

    preg_match_all($regex,$string,$matches);
    print_r($matches);

    This code properly extracts all the links, with the text between the tag.

    Hope its helpful to some one like me :)

    By the way thanks a lot to you all guys, this post was also a climbing step for me to achive my goal.

  18. Written by Mustafa Rampurawala on June 5, 2011 at 3:40 am
    Permalink

    sorry the code is just messed up :( will try to post again with pre tags, if it works

    $regex = '#((?:(?!).)*)#i';
    
    $string = ''; // however you like get your string for input
    
    preg_match_all($regex,$string,$matches);
    print_r($matches);
    
  19. Written by Mustafa Rampurawala on June 5, 2011 at 3:42 am
    Permalink

    $regex = ‘#((?:(?!).)*)#i’;

    hope this works,

  20. Written by Mustafa Rampurawala on June 5, 2011 at 3:44 am
    Permalink

    nope? unfortunately can”t get it to show correct code, but in this code have also taken the care if the href tag is not just after the a tag, the code i have tested works great.

    Hope webmaster can fix this so that others may benifit from it

Subscribe to comments via RSS

Leave a Reply