Screen Scrape POST Form Results Pages

Heya, Mike.

Are you trying to create a screenshot of a web page?

The URL:
http://www.scrapedsite.com/these/are/results.htm

Is most likely either being rewritten server side, or else results.htm relies on _SESSION or _POST variables to produce the content.

You may have some success creating an HTTP stream context or using cURL to simulate a POST request, but if the site uses session variables, you may be out of luck.

Thanks for the responses everyone. Once again, Atli is trying to be helpful. What I mean by scrape is that I want to grab the content on a webpage (the HTML), so that I can parse it and use the data somehow.

I want to grab the data from the results pages after using this form .

It's actually a series of forms (you click next to get to the next drop down list), so I'm thinking that they're using session variables. From what you said, it sounds like it's not possible to grab the exact URL. :/

Aug 29 '07 #4

Atli

5,058

Expert 4TB

Ahh ok. Now I see :)

I've looked at that form and it looks like they are using Sessions or some sort of server-side apparatus to pass the data between the forms. Unless its hidden somewhere in the sizable amount of JavaScript.

Then again this is a ASP.NET server so I could be overlooking something.

Aug 29 '07 #5

Ahh ok. Now I see :)

I've looked at that form and it looks like they are using Sessions or some sort of server-side apparatus to pass the data between the forms. Unless its hidden somewhere in the sizable amount of JavaScript.

Then again this is a ASP.NET server so I could be overlooking something.

Atli, so do you agree that there's no possibility of scraping the site?

Aug 29 '07 #6

Atli

5,058

Expert 4TB

Atli, so do you agree that there's no possibility of scraping the site?

Yea I would have to agree. Sorry :/

Aug 30 '07 #7

Heya, Mike.

Are you trying to create a screenshot of a web page?

The URL:
http://www.scrapedsite.com/these/are/results.htm

Is most likely either being rewritten server side, or else results.htm relies on _SESSION or _POST variables to produce the content.

You may have some success creating an HTTP stream context or using cURL to simulate a POST request, but if the site uses session variables, you may be out of luck.

Just out of curiosity, why do you say that using session variables prevents me from scraping the site? Is it not possible to get session information for an external site?

[edit] Nevermind, it makes sense to me now. I guess you can't, since it's in the code script.

Aug 30 '07 #8

pbmods

5,821

Expert 4TB

Heya, Mike.

You could in theory send out a series of requests using cURL that simulate successive form submissions. You'd also have to handle the session cookie, which I'm not entirely sure how to go about, but I'm sure it's possible.

In theory, you should be able to simulate the set of HTTP requests that would ultimately result in receiving the output you are looking for.

Aug 30 '07 #9

Heya, Mike.

You could in theory send out a series of requests using cURL that simulate successive form submissions. You'd also have to handle the session cookie, which I'm not entirely sure how to go about, but I'm sure it's possible.

In theory, you should be able to simulate the set of HTTP requests that would ultimately result in receiving the output you are looking for.

Thanks pbmods. Gives me some hope. But wouldn't I have to know the name of the sessions variable that's being used? Is it possible to find this information without the source code?

I guess that's the next question. I'll do some researching, but does anybody have any idea how to handle sessions information in this situation?

Aug 30 '07 #10

pbmods

5,821

Expert 4TB

Heya, Mike.

You wouldn't need to know the name of any session variables any more than you would if you visited the site in your web browser.

What I'm proposing would be (from the server's point of view) as if you visited the site using a browser. The only real difference is that your code would just be sending POST requests, and it would do nothing with the HTML that was returned to it until the very end.

For example, suppose there's a three-form process. Your script would send all the POST data to the remote server as if you had submitted the first form from a browser. Now if you were using a web browser, you would then see the second form. However, your code would just skip all that because you would program it to automatically know what goes in the second form.

Next, your script would send another POST request to the remote as if you had submitted the second form. Making sure the session cookie gets passed is important because the remote server needs to treat both requests as if they came from the same 'browser'.

And finally, your script would send the third and final POST request as if you had submitted the third form in a browser. This time, instead of ignoring the response from the remote server, you would grab the data and process it as your scrape.

[EDIT: I suppose if you wanted to have some fun, you could write an HTML parser that could extract form inputs from the HTML that you get back after sending each POST request and dynamically build the next set of POST variables. While you're at it, you might as well just ask the owner of the remote server to set up a SOAP server :P]

Aug 30 '07 #11

Heya, Mike.

You wouldn't need to know the name of any session variables any more than you would if you visited the site in your web browser.

What I'm proposing would be (from the server's point of view) as if you visited the site using a browser. The only real difference is that your code would just be sending POST requests, and it would do nothing with the HTML that was returned to it until the very end.

For example, suppose there's a three-form process. Your script would send all the POST data to the remote server as if you had submitted the first form from a browser. Now if you were using a web browser, you would then see the second form. However, your code would just skip all that because you would program it to automatically know what goes in the second form.

Next, your script would send another POST request to the remote as if you had submitted the second form. Making sure the session cookie gets passed is important because the remote server needs to treat both requests as if they came from the same 'browser'.

And finally, your script would send the third and final POST request as if you had submitted the third form in a browser. This time, instead of ignoring the response from the remote server, you would grab the data and process it as your scrape.

[EDIT: I suppose if you wanted to have some fun, you could write an HTML parser that could extract form inputs from the HTML that you get back after sending each POST request and dynamically build the next set of POST variables. While you're at it, you might as well just ask the owner of the remote server to set up a SOAP server :P]

Thanks once again for the input, pbmods. But I was wondering if I would really need to know sessions data for what I'm trying to do. I'm basically trying to scrape every results page for cars in that web app that are sold in the US. This means that I'd be pretty much hitting most of the drop-down combinations. Since I'd have to cycle through each option list anyway, maybe I should just scrape the lists after each form submission, store it in an array, and then move on to the next list by posting to the static URL that I see in the address bar when I click "Next". In this case, I wouldn't automatically post each form one after another. The pseudocode would look something like this:

for each make-year combination {

post form
if drop down on next page exists {
scrape all models from drop down list
for each model {
post form
...
}
}

}

Sorry if that's difficult to read. What do you think?

Aug 30 '07 #12

pbmods

5,821

Expert 4TB

I think you should ask the owner of the site you're scraping to integrate a SOAP server into his site's functionality :P

It sounds like you and I are on the same page. What you are describing more or less takes what I suggested a step further by actually processing the results of submitting the form; that is, when you simulate a submission of the first form, you are then processing the HTML of the second form that comes back and then making multiple submissions, one for each option in the select box.

One thing I would caution you on if you decide to do this:
- If the form has 25 options on the first form and 10 options on the second form, that means that you are sending *250* requests to that site *every time* somebody visits your page.
- If you get even a modest 10 hits per minute, that's *2,500* requests to the other site every minute, above and beyond their traffic.

Aside from the fact that your Users would have to wait for the other site to send 250 responses back to your server before their page loads, that's a pretty heavy strain on their server, which they have to pay for AND they're not getting any advertising revenue from!

My advice would be to either create a script that does this *once* and then caches all of the results (which would make your script run faster after the first execution anyway), or else work with the owner of the other site to find a way to solve this problem without putting a huge strain on his server and making your page take too long to load.

Aug 30 '07 #13

passing variables and screen scrape

I think you should ask the owner of the site you're scraping to integrate a SOAP server into his site's functionality :P

It sounds like you and I are on the same page. What you are describing more or less takes what I suggested a step further by actually processing the results of submitting the form; that is, when you simulate a submission of the first form, you are then processing the HTML of the second form that comes back and then making multiple submissions, one for each option in the select box.

One thing I would caution you on if you decide to do this:
- If the form has 25 options on the first form and 10 options on the second form, that means that you are sending *250* requests to that site *every time* somebody visits your page.
- If you get even a modest 10 hits per minute, that's *2,500* requests to the other site every minute, above and beyond their traffic.

Aside from the fact that your Users would have to wait for the other site to send 250 responses back to your server before their page loads, that's a pretty heavy strain on their server, which they have to pay for AND they're not getting any advertising revenue from!

My advice would be to either create a script that does this *once* and then caches all of the results (which would make your script run faster after the first execution anyway), or else work with the owner of the other site to find a way to solve this problem without putting a huge strain on his server and making your page take too long to load.

Thanks for the thoughts. I think in my case, the issue you brought up won't be a problem, because I want to run this once a year. The results of the scrape are going to be inserted into MySQL tables once I parse the data. Basically, I'm going to store the results of the scrape on our site.

Aug 31 '07 #14

pbmods

5,821

Expert 4TB

Ah. Well then.

You might find this to be of some use:

Expand|Select|Wrap|Line Numbers

 
$data = urlencode('var=value&var2=value2&you=get&the=idea');

$curl = curl_init('http://url.goes/right.here');
 
curl_setopt_array($curl, array(

    CURLOPT_HEADER            =>    true,

    CURLOPT_RETURNTRANSFER    =>    true,

    CURLOPT_SSL_VERIFYPEER    =>    0,    //    Not supported in PHP

    CURLOPT_SSL_VERIFYHOST    =>    0,    //        at this time.

    CURLOPT_HTTPHEADER        =>    array(

        'Content-type: application/x-www-form-urlencoded; charset=utf-8',

        'Content-length: ' . strlen($data)

    ),

    CURLOPT_POST            =>    true,

    CURLOPT_POSTFIELDS        =>    $data

));
 
$response = curl_exec($curl);

$split = preg_split('#^\\s+$#m', $response);
 
/***************

*

*    cURL closes itself when the connection is not persistent.

*/

return $split[1];

Not sure how to deal with session cookies... or maybe cURL takes care of that automatically...?

Aug 31 '07 #15

pbmods

5,821

Expert 4TB

This is a slightly more useful version that doesn't force itself to do more work than necessary (notice the absence of CURLOPT_HEADER):

Expand|Select|Wrap|Line Numbers

 
$data = urlencode('var=value&var2=value2&you=get&the=idea');

$curl = curl_init('http://url.goes/right.here');
 
curl_setopt_array

(

    $curl,

    array

    (

        CURLOPT_RETURNTRANSFER  =>    true,

        CURLOPT_SSL_VERIFYPEER  =>    0,  //    Not supported in PHP

        CURLOPT_SSL_VERIFYHOST  =>    0,  //        at this time.

        CURLOPT_HTTPHEADER      =>

            array

            (

                'Content-type: application/x-www-form-urlencoded; charset=utf-8',

                'Content-length: ' . strlen($data)

            ),

        CURLOPT_POST            =>    true,

        CURLOPT_POSTFIELDS      =>    $data

    )

);
 
$response = curl_exec($curl);

curl_close($curl);
 
return $response;

Oct 21 '07 #16

Similar topics

Screen scraping anyone?

by: Jim Hubbard | last post by:

Anyone know of a control (or code showing how to) scrape addresses from a web page? Thanks for your help!

.NET Framework

by: Jason Steeves | last post by:

I have one .aspx form that my users fill out and this then takes that information and populates a second .aspx form via session variables. I need to screen scrape the second .aspx form and e-mail...

.NET Framework

ASP.NET screen scrape after form submittal?

by: Rob Lauer | last post by:

I have written two completely separate web applications that cannot talk directly to one another (applications "A" and "B"). Application "A" has a form that takes some input (radio buttons,...

screen scrape + login

by: n8 | last post by:

Hi, Hi have to do the followign and have been racking my brain with various solutions that have had no so great results. I want to use the System.Net.WebClient to submit data to a form (log a...

Screen scrape .aspx page using WebClient...

Screen Scraping a web page

by: Steve | last post by:

I am working on an application to screen scrape information from a web page. I have the base code working but the problem is I have to login before I can get the info I need. The page is hosted on...

Visual Basic .NET

by: tscamurra | last post by:

Hello, I have a web app that performs screen scaping and submits a form. My code worked until the page was changed to use .aspx code. I am updating my code to conform to the new pages but am...

ASP.NET Beginner trying to screen scrape, where do i start??

by: alan.aylett | last post by:

Hi, I have three years experience programming with Java so am programming language literate. I have to implement a tool to 'screen scrape' and it is apparent that this is simplest using the .NET...

How to screen scrape for results?

by: Swanand Mokashi | last post by:

Hi all -- I would like to create an application(call it Application "A") that I would like to mimic exactly as a form on a foreign system (Application "F"). Application "F" is on the web (so...

Screen Scraping a Password Protected Site

by: Gregory A Greenman | last post by:

I'm trying to screen scrape a site that requires a password. If I access the site's login page in my browser and view the source, I see that it does not contain a viewstate. When my program...

Visual Basic .NET

Screen Scrape POST Form Results Pages

by: nbomike | last post by:

Hello. I want to scrape pages from a site that generates pages from form inputs using this web app . However, the URL of the results page (the page I want to scrape) is masked and is always the same....

ASP / Active Server Pages

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...