Login or Sign up Help | Site Map
Connecting Tech Pros Worldwide

Parsing News: Where to Start

By Darrell Shifflett
Senior Editor, Linux-2000.org

Leeching News from sites and sending to HTML format

Steps:

  1. First we will write a small bash script to get the news.
  2. We will write a perl script for inserting news into MySQL.
  3. Using PHP we will make a simple configurable table for news results.

A simple bash script to get News:

# --------------- // cut // ---------------
#!/bin/sh

cd $HOME/News
rm -f $HOME/News/ultramode.txt
wget http://slashdot.org/ultramode.txt
cat $HOME/News/ultramode.txt | $HOME/newsbin/slashnews.pl

# --------------- // cut // ---------------

  • name the above script 'slashnews'
  • chmod 755 slashnews
  • mkdir News
  • mkdir newsbin
  • This script will go into your new ~/News directory.
  • This is all done in a /home/your_uid * ***NOT ROOT***
  • The above script can be run as a cron no less that 30mins apart
  • slashdot.org clearly states this, we dont want to abuse and lose the backend they are letting us use :-)

Step 2:

Writing the slashnews.pl script. Simple? Well let's see :)

-------------------------// cut //----------------------------

#!/usr/bin/perl
#

# Open your MySQL connection.
use DBI;
$dbh = DBI->connect("DBI:mysql:news:localhost", "user_id", "password")
or die "$dbh->errstr\n";

my $sth = $dbh->prepare(q{
delete from slashdot
}) or die "$dbh->errstr\n";
$sth->execute if (defined $sth);
$sth->finish if (defined $sth);

$FALSE = 0;
$TRUE = 1;

my $line;
my $haverecord = $FALSE;

my %record = ();
my @keys = ( "title", "link", "time", "author", "dept",
"category",
"numcomments", "storytype", "imagename" );

while ($line = <STDIN>) {
chomp $line;

if ($haverecord) {
my $last = $FALSE;

foreach $key (@keys) {
$record{$key} = $line;
if ($key eq $keys[-1]) {
$last = $TRUE;
}

if (!$last) {
$line = <STDIN>;
chomp $line;
}
}

$haverecord = $FALSE;

print "Title: ", $record{title}, "\n";

my $sql = "insert into slashdot values ( 0, ";

foreach $key (@keys) {
$sql .= $dbh->quote($record{$key}) .
($key eq $keys[-1] ? "" : ", ");
}

$sql .= ")";

$sth = $dbh->prepare($sql) or die "$dbh->errstr\n";
$sth->execute if (defined $sth);
$sth->finish if (defined $sth)
}
else {
if ($line =~ /^\%\%$/) {
$haverecord = $TRUE;
}
}
}

# Done , now lets clean up.
$dbh->disconnect;

exit 0;

# EOF

# <--//// script end ////-->

-------------------------// cut //----------------------------

  • Lets name above script 'slashnews.pl'
  • chmod 700 slashnews.pl
  • Since passwords are in the script we dont want the world reading it
  • This goes in your new ~/newsbin directory.