Recently I developed a strange wear pattern on my car’s front passenger side tire (that’s front rightside tire for those driving on the “wrong” side”) – the outer surface of the tire was warn almost completely bald while the inside treat was fresh and meaty. This was rather odd, as I recently had the car re-aligned with zero toe and max negative camber, which comes out to around -1.6 on both sides. Later I found that as a result of this slight adjustment, I had been pushing the car a lot more aggressively while driving to and from work. The downside of this, as I retraced my steps, is that my drive home from work is almost entirely left turns. When coupled with the fact that I had been recently pushing the car very hard on these left turns, it made sense that I had worn out just the outer edge of the tire.
Since the car is AWD, I didn’t have the option of just replacing that one tire as the difference in tread levels can be harmful to the AWD system. Now here comes the difficult task of finding a single Bridgestone Potenza RE-960/AS Pole Position tire in 225/45/17 so it can be shaved down and mounted.
After some searching I came upon the website BestUsedTires.com – but searching for “Potenza 960” yielded the tire in all sizes but the size I needed. Now, since I’m lazy, how would I automate this search and have my computer notify ME instead of having to manually check every day for the tire I needed?
First, lets take a look at how the website parses searches – I viewed the source code of the website and looked for the “form” HTML tag. Quickly I found this snippit:
<form name=”search_form” action=”search-exec/” method=”post”>
From here you now know that the applicable action script is http://www.bestusedtires.com/search-exec/ and that the website passes the search variables to it with the POST method. Now, we need to search for what kinds of data the website passes to the action script. In the source code, I search for “input” tags and found this block:
<input type=”hidden” name=”nodecode” value=”true”>
<input name=”search_term” type=”text” id=”Search2″ size=”25″>
<input type=”hidden” name=”do” value=”search”>
Now we know the variables that have to be passed to /search-exec/ – they are “nodecode”, “search_term”, and “hidden”. We now have enough information to begin writing a shell script that will automatically run the search for me and report back with it’s findings.
Fire up your Linux shell – we will be using the lwp-request toolkit to process our request. If you don’t have it installed, you can install it on Debian-based distributions like Ubuntu by running the following command:
sudo apt-get install libwww-perl
lwp-request will change it’s behavior based on how the command is called; if you run it as GET, it will pass GET variables and return the HTML body; if you run it as POST, it will pass POST variables and return the HTML body. If you run it as HEAD; it will return the response header. If you check the documentation with man lwp-request – you will see that it to pass variables to it while running a POST by using STDIN. Lets create a simple echo statement so we can pass it to lwp-request.
HTML form variables are passed in the format VariableName=VariableValue and are chained together with & (ampersands) – this will be the same for GET and POST requests. The variable values must also be URL encoded, so instead of “this is a value” they need to be “this%20is%20a%20value”. In our case since I am searching for a Potenza 960 225/45/17, our variables will be such:
String them all together, and pass it through “echo” and you have this:
Now, take the output of that echo statement and pass it to lwp-request called as POST:
echo “search_term=potenza%20960%20225/45/17&do=search&nodecode=true | POST http://www.bestusedtires.com/search-exec/
You should now get the HTML body of the resultant search page! If you dig through this source code, you can determine where the website starts to dump the search result. The goal here is to determine a specific marker on the website that you can grep out of the large string of the HTML body, so you can parse it out and determine if your search result came back with anything. Try running this search term for something that you know will produce a hit, and once for something you know will not produce a hit. If you dig around some, you will see that positive results with hits will have this specific line in the resultant HTML body:
<table width=”100%”><tr><td align=”center” valign=”top”>
Perfect! Now we need to pipe the output of our earlier echo and POST command to grep to search for this specific output. You will need to escape the quotes, because quotes have a special meaning on the Linux command line. It’s simple, really, just put a backslash (\) before the quotes you want to escape. You should have this now:
echo “search_term=potenza%20960%20225/45/17&do=search&nodecode=true” | POST http://www.bestusedtires.com/search-exec/ | grep “<table width=\”100%\”><tr><td align=\”center\” valign=\”top\”>”
Since the bit at the end is unique to the response body only if there is a hit from the search, this bit will return null if there isn’t a search match for the tire that we’re searching for. How handy, now we can take this and feed it into a short IF loop to tell the script to notify us if the website were ever able to find anything. Since our short script above handily doesn’t output anything if the search for our tire comes back with no results, all we have to do is run this search through a test to see if the string returned is empty or not. In other words, the IF loop will test true only if the search term comes back with result, otherwise, it will skip out of the IF loop and peacefully exit without a peep. Here we will also introduce the backtick (`)
The backtick is used to indicate to the Linux shell that whatever is within the backtick is to be executed verbatim as a command. The output of that command will be substituted into where the backtick was literally. For example, if you have this short command:
This would interpret as simply the command date, which of course, returns the current system time. In our script, we want to pass the output of our search term to the IF loop to test and see if the output is null or not. Then, if the search term returns any characters at all, drop into the IF loop and execute whatever we have inside. We will use the -n test case, which tests if the string immediately following is non-zero or not. Since we will have the string immediately following be the search term in backticks, the shell will check to see if our search comes back with anything. Your command will look like this:
if [[ -n `echo “search_term=potenza%20960%20225/45/17&do=search&nodecode=true” | POST http://www.bestusedtires.com/search-exec/ | grep “<table width=\”100%\”><tr><td align=\”center\” valign=\”top\”>”` ]];
Now, all that is left is to decide what you want your script to perform if the test case comes back positive! The possibilities here are pretty endless; I have it set to email me upon a successful search, so my complete script to search for a used tire that matches my criteria and then email me when it does find this match is as follows:
if [[ -n `echo “search_term=potenza%20960%20225/45/17&do=search&nodecode=true” | POST http://www.bestusedtires.com/search-exec/ | grep “<table width=\”100%\”><tr><td align=\”center\” valign=\”top\”>”` ]]; then echo “A match for a Bridgestone Potenza RE960AS in 225/45/17 has been located on http://www.bestusedtires.com” | mail firstname.lastname@example.org; fi;
Since my email is pushed to my phone, this has the effect of immediately notifying me when a tire that I want is available. For painless automagic searches, simply add this script as a cronjob to have it search every hour. The entire cron line would be as follows:
0 * * * * if [[ -n `echo “search_term=potenza%20960%20225/45/17&do=search&nodecode=true” | POST http://www.bestusedtires.com/search-exec/ | grep “<table width=\”100%\”><tr><td align=\”center\” valign=\”top\”>”` ]]; then echo “A match for a Bridgestone Potenza RE960AS in 225/45/17 has been located on http://www.bestusedtires.com” | mail email@example.com; fi;
And that’s it! This script will happily run every hour at the top of the hour and should it find a result, it will send an email with the contents “A match for a Bridgestone Potenza RE960AS in 225/45/17 has been located on http://www.bestusedtires.com” – thus reminding me to check the website and purchase my tire.
Yep, that’s ugly all right. Does the job, and that’s all there is to say for it.
Not to be mean, but this is rather ugly. A smoother implementation could be built with wget(which is already in most distros), and a bit of any given scripting language to digest the page… grep works fine, especially in simple cases, but most hacks of this nature will require a great deal more complexity in deciphering the page.
Can you show how to apply wget to this example?
if [ `wget -O – –post-data=”search_term=potenza%20960%20225/45/17&do=search&nodecode=true” http://www.bestusedtires.com/search-exec/ | grep -c ”` -gt 0 ]; then …
Or, you could use curl, like all the 1337 hackers do:
curl -d “search_term=potenza%20960%20225/45/17&do=search&nodecode=true” http://www.bestusedtires.com/search-exec/
Good job,easily understandable for the basic concepts of scraping site and this knowledge can easily be built on for more elegant or complex solutions.
Or you could just quit driving like a retard.
I agree with Derp. Stop being a weisel, go find a track and rent some time instead of being a danger on the street.
I think that this little trick was wonderfully explained for those are very noob on linux (that’s my case)..
Can you really rent tires? Maybe rent-to-own for poor people. I guess that is getting off topic. Nice post Andrew. Just because it is not elegant doesn’t mean it is not useful.
you could just use curl, and maybe grep/awk to check for the results.
the curl man page is here : http://curl.haxx.se/docs/manpage.html
the relevant flag is :
-d and –data-ascii (to url encode the string, such as the search term).
I pasted the command needed here (dont know what your comment section will do to shell code): https://gist.github.com/755175
I tend to use curl over wget, because curl returns the output to the shell by default, which allows you to grep it directly.
you might be interested in YQL: http://developer.yahoo.com/yql/ it allows you to query internet like it would be a database.
Other things you might be interested in would be google search alerts – you set up a query and when there are results, you are notified via email and also yahoo pipes, which is a “custom rss engine” – you can edit, modify, filter… RSS channels, websites, run them through external services etc.
Yeah – not widely publicized, but Google Alerts is the idiots way to do this. It has no hack value, but it gets the job done easily.
There are a million ways to solve the problem. Personally, I would have put the whole thing into a separate script file rather than trying to fit everything onto one line. In addition, I would have wanted the script to email me a copy of the result set, html encoded of course.
Nice work, otherwise.
WebHarvest is also a neat tool for related purposes: