473,416 Members | 1,496 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,416 software developers and data experts.

Screen Scrape POST Form Results Pages

17
Hello. I want to scrape pages from a site that generates pages from form inputs. However, the URL of the results page (the page I want to scrape) is masked and is always the same. It looks something like this for every form input combination:

http://www.scrapedsite.com/these/are/results.htm

Is it even possible to scrape the individual results pages? Thanks.
Aug 28 '07 #1
15 9462
Atli
5,058 Expert 4TB
Hi.

I am not entirely sure what you mean by scrape.
(as far as my English knowledge is concerned, that word doesn't really fit in that context... but then again, what do I know :P)
Aug 29 '07 #2
pbmods
5,821 Expert 4TB
Heya, Mike.

Are you trying to create a screenshot of a web page?

The URL:
http://www.scrapedsite.com/these/are/results.htm

Is most likely either being rewritten server side, or else results.htm relies on _SESSION or _POST variables to produce the content.

You may have some success creating an HTTP stream context or using cURL to simulate a POST request, but if the site uses session variables, you may be out of luck.
Aug 29 '07 #3
nbomike
17
Heya, Mike.

Are you trying to create a screenshot of a web page?

The URL:
http://www.scrapedsite.com/these/are/results.htm

Is most likely either being rewritten server side, or else results.htm relies on _SESSION or _POST variables to produce the content.

You may have some success creating an HTTP stream context or using cURL to simulate a POST request, but if the site uses session variables, you may be out of luck.
Thanks for the responses everyone. Once again, Atli is trying to be helpful. What I mean by scrape is that I want to grab the content on a webpage (the HTML), so that I can parse it and use the data somehow.

I want to grab the data from the results pages after using this form .

It's actually a series of forms (you click next to get to the next drop down list), so I'm thinking that they're using session variables. From what you said, it sounds like it's not possible to grab the exact URL. :/
Aug 29 '07 #4
Atli
5,058 Expert 4TB
Ahh ok. Now I see :)

I've looked at that form and it looks like they are using Sessions or some sort of server-side apparatus to pass the data between the forms. Unless its hidden somewhere in the sizable amount of JavaScript.

Then again this is a ASP.NET server so I could be overlooking something.
Aug 29 '07 #5
nbomike
17
Ahh ok. Now I see :)

I've looked at that form and it looks like they are using Sessions or some sort of server-side apparatus to pass the data between the forms. Unless its hidden somewhere in the sizable amount of JavaScript.

Then again this is a ASP.NET server so I could be overlooking something.
Atli, so do you agree that there's no possibility of scraping the site?
Aug 29 '07 #6
Atli
5,058 Expert 4TB
Atli, so do you agree that there's no possibility of scraping the site?
Yea I would have to agree. Sorry :/
Aug 30 '07 #7
nbomike
17
Heya, Mike.

Are you trying to create a screenshot of a web page?

The URL:
http://www.scrapedsite.com/these/are/results.htm

Is most likely either being rewritten server side, or else results.htm relies on _SESSION or _POST variables to produce the content.

You may have some success creating an HTTP stream context or using cURL to simulate a POST request, but if the site uses session variables, you may be out of luck.
Just out of curiosity, why do you say that using session variables prevents me from scraping the site? Is it not possible to get session information for an external site?

[edit] Nevermind, it makes sense to me now. I guess you can't, since it's in the code script.
Aug 30 '07 #8
pbmods
5,821 Expert 4TB
Heya, Mike.

You could in theory send out a series of requests using cURL that simulate successive form submissions. You'd also have to handle the session cookie, which I'm not entirely sure how to go about, but I'm sure it's possible.

In theory, you should be able to simulate the set of HTTP requests that would ultimately result in receiving the output you are looking for.
Aug 30 '07 #9
nbomike
17
Heya, Mike.

You could in theory send out a series of requests using cURL that simulate successive form submissions. You'd also have to handle the session cookie, which I'm not entirely sure how to go about, but I'm sure it's possible.

In theory, you should be able to simulate the set of HTTP requests that would ultimately result in receiving the output you are looking for.
Thanks pbmods. Gives me some hope. But wouldn't I have to know the name of the sessions variable that's being used? Is it possible to find this information without the source code?

I guess that's the next question. I'll do some researching, but does anybody have any idea how to handle sessions information in this situation?
Aug 30 '07 #10
pbmods
5,821 Expert 4TB
Heya, Mike.

You wouldn't need to know the name of any session variables any more than you would if you visited the site in your web browser.

What I'm proposing would be (from the server's point of view) as if you visited the site using a browser. The only real difference is that your code would just be sending POST requests, and it would do nothing with the HTML that was returned to it until the very end.

For example, suppose there's a three-form process. Your script would send all the POST data to the remote server as if you had submitted the first form from a browser. Now if you were using a web browser, you would then see the second form. However, your code would just skip all that because you would program it to automatically know what goes in the second form.

Next, your script would send another POST request to the remote as if you had submitted the second form. Making sure the session cookie gets passed is important because the remote server needs to treat both requests as if they came from the same 'browser'.

And finally, your script would send the third and final POST request as if you had submitted the third form in a browser. This time, instead of ignoring the response from the remote server, you would grab the data and process it as your scrape.

[EDIT: I suppose if you wanted to have some fun, you could write an HTML parser that could extract form inputs from the HTML that you get back after sending each POST request and dynamically build the next set of POST variables. While you're at it, you might as well just ask the owner of the remote server to set up a SOAP server :P]
Aug 30 '07 #11
nbomike
17
Heya, Mike.

You wouldn't need to know the name of any session variables any more than you would if you visited the site in your web browser.

What I'm proposing would be (from the server's point of view) as if you visited the site using a browser. The only real difference is that your code would just be sending POST requests, and it would do nothing with the HTML that was returned to it until the very end.

For example, suppose there's a three-form process. Your script would send all the POST data to the remote server as if you had submitted the first form from a browser. Now if you were using a web browser, you would then see the second form. However, your code would just skip all that because you would program it to automatically know what goes in the second form.

Next, your script would send another POST request to the remote as if you had submitted the second form. Making sure the session cookie gets passed is important because the remote server needs to treat both requests as if they came from the same 'browser'.

And finally, your script would send the third and final POST request as if you had submitted the third form in a browser. This time, instead of ignoring the response from the remote server, you would grab the data and process it as your scrape.

[EDIT: I suppose if you wanted to have some fun, you could write an HTML parser that could extract form inputs from the HTML that you get back after sending each POST request and dynamically build the next set of POST variables. While you're at it, you might as well just ask the owner of the remote server to set up a SOAP server :P]
Thanks once again for the input, pbmods. But I was wondering if I would really need to know sessions data for what I'm trying to do. I'm basically trying to scrape every results page for cars in that web app that are sold in the US. This means that I'd be pretty much hitting most of the drop-down combinations. Since I'd have to cycle through each option list anyway, maybe I should just scrape the lists after each form submission, store it in an array, and then move on to the next list by posting to the static URL that I see in the address bar when I click "Next". In this case, I wouldn't automatically post each form one after another. The pseudocode would look something like this:

for each make-year combination {
post form
if drop down on next page exists {
scrape all models from drop down list
for each model {
post form
...
}
}
}

Sorry if that's difficult to read. What do you think?
Aug 30 '07 #12
pbmods
5,821 Expert 4TB
I think you should ask the owner of the site you're scraping to integrate a SOAP server into his site's functionality :P

It sounds like you and I are on the same page. What you are describing more or less takes what I suggested a step further by actually processing the results of submitting the form; that is, when you simulate a submission of the first form, you are then processing the HTML of the second form that comes back and then making multiple submissions, one for each option in the select box.

One thing I would caution you on if you decide to do this:
- If the form has 25 options on the first form and 10 options on the second form, that means that you are sending *250* requests to that site *every time* somebody visits your page.
- If you get even a modest 10 hits per minute, that's *2,500* requests to the other site every minute, above and beyond their traffic.

Aside from the fact that your Users would have to wait for the other site to send 250 responses back to your server before their page loads, that's a pretty heavy strain on their server, which they have to pay for AND they're not getting any advertising revenue from!

My advice would be to either create a script that does this *once* and then caches all of the results (which would make your script run faster after the first execution anyway), or else work with the owner of the other site to find a way to solve this problem without putting a huge strain on his server and making your page take too long to load.
Aug 30 '07 #13
nbomike
17
I think you should ask the owner of the site you're scraping to integrate a SOAP server into his site's functionality :P

It sounds like you and I are on the same page. What you are describing more or less takes what I suggested a step further by actually processing the results of submitting the form; that is, when you simulate a submission of the first form, you are then processing the HTML of the second form that comes back and then making multiple submissions, one for each option in the select box.

One thing I would caution you on if you decide to do this:
- If the form has 25 options on the first form and 10 options on the second form, that means that you are sending *250* requests to that site *every time* somebody visits your page.
- If you get even a modest 10 hits per minute, that's *2,500* requests to the other site every minute, above and beyond their traffic.

Aside from the fact that your Users would have to wait for the other site to send 250 responses back to your server before their page loads, that's a pretty heavy strain on their server, which they have to pay for AND they're not getting any advertising revenue from!

My advice would be to either create a script that does this *once* and then caches all of the results (which would make your script run faster after the first execution anyway), or else work with the owner of the other site to find a way to solve this problem without putting a huge strain on his server and making your page take too long to load.
Thanks for the thoughts. I think in my case, the issue you brought up won't be a problem, because I want to run this once a year. The results of the scrape are going to be inserted into MySQL tables once I parse the data. Basically, I'm going to store the results of the scrape on our site.
Aug 31 '07 #14
pbmods
5,821 Expert 4TB
Ah. Well then.

You might find this to be of some use:
Expand|Select|Wrap|Line Numbers
  1. $data = urlencode('var=value&var2=value2&you=get&the=idea');
  2. $curl = curl_init('http://url.goes/right.here');
  3.  
  4. curl_setopt_array($curl, array(
  5.     CURLOPT_HEADER            =>    true,
  6.     CURLOPT_RETURNTRANSFER    =>    true,
  7.     CURLOPT_SSL_VERIFYPEER    =>    0,    //    Not supported in PHP
  8.     CURLOPT_SSL_VERIFYHOST    =>    0,    //        at this time.
  9.     CURLOPT_HTTPHEADER        =>    array(
  10.         'Content-type: application/x-www-form-urlencoded; charset=utf-8',
  11.         'Content-length: ' . strlen($data)
  12.     ),
  13.     CURLOPT_POST            =>    true,
  14.     CURLOPT_POSTFIELDS        =>    $data
  15. ));
  16.  
  17. $response = curl_exec($curl);
  18. $split = preg_split('#^\\s+$#m', $response);
  19.  
  20. /***************
  21. *
  22. *    cURL closes itself when the connection is not persistent.
  23. */
  24. return $split[1];
  25.  
Not sure how to deal with session cookies... or maybe cURL takes care of that automatically...?
Aug 31 '07 #15
pbmods
5,821 Expert 4TB
This is a slightly more useful version that doesn't force itself to do more work than necessary (notice the absence of CURLOPT_HEADER):

Expand|Select|Wrap|Line Numbers
  1. $data = urlencode('var=value&var2=value2&you=get&the=idea');
  2. $curl = curl_init('http://url.goes/right.here');
  3.  
  4. curl_setopt_array
  5. (
  6.     $curl,
  7.     array
  8.     (
  9.         CURLOPT_RETURNTRANSFER  =>    true,
  10.         CURLOPT_SSL_VERIFYPEER  =>    0,  //    Not supported in PHP
  11.         CURLOPT_SSL_VERIFYHOST  =>    0,  //        at this time.
  12.         CURLOPT_HTTPHEADER      =>
  13.             array
  14.             (
  15.                 'Content-type: application/x-www-form-urlencoded; charset=utf-8',
  16.                 'Content-length: ' . strlen($data)
  17.             ),
  18.         CURLOPT_POST            =>    true,
  19.         CURLOPT_POSTFIELDS      =>    $data
  20.     )
  21. );
  22.  
  23. $response = curl_exec($curl);
  24. curl_close($curl);
  25.  
  26. return $response;
Oct 21 '07 #16

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Jim Hubbard | last post by:
Anyone know of a control (or code showing how to) scrape addresses from a web page? Thanks for your help!
0
by: Jason Steeves | last post by:
I have one .aspx form that my users fill out and this then takes that information and populates a second .aspx form via session variables. I need to screen scrape the second .aspx form and e-mail...
2
by: Rob Lauer | last post by:
I have written two completely separate web applications that cannot talk directly to one another (applications "A" and "B"). Application "A" has a form that takes some input (radio buttons,...
14
by: n8 | last post by:
Hi, Hi have to do the followign and have been racking my brain with various solutions that have had no so great results. I want to use the System.Net.WebClient to submit data to a form (log a...
0
by: Steve | last post by:
I am working on an application to screen scrape information from a web page. I have the base code working but the problem is I have to login before I can get the info I need. The page is hosted on...
3
by: tscamurra | last post by:
Hello, I have a web app that performs screen scaping and submits a form. My code worked until the page was changed to use .aspx code. I am updating my code to conform to the new pages but am...
4
by: alan.aylett | last post by:
Hi, I have three years experience programming with Java so am programming language literate. I have to implement a tool to 'screen scrape' and it is apparent that this is simplest using the .NET...
7
by: Swanand Mokashi | last post by:
Hi all -- I would like to create an application(call it Application "A") that I would like to mimic exactly as a form on a foreign system (Application "F"). Application "F" is on the web (so...
3
by: Gregory A Greenman | last post by:
I'm trying to screen scrape a site that requires a password. If I access the site's login page in my browser and view the source, I see that it does not contain a viewstate. When my program...
1
by: nbomike | last post by:
Hello. I want to scrape pages from a site that generates pages from form inputs using this web app . However, the URL of the results page (the page I want to scrape) is masked and is always the same....
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.