How to block bad website Bots and Spiders Using .htaccess Modifications
Bots are extremely prevalent online. As of 2012, bot traffic was higher than people’s traffic across the internet. This means that more than 50% of visitors to your website in average, are generated by robots, not humans. – Block Bots Spiders Htaccess
Bots can serve a range of uses but they are not all good. Some bots, for instance, the bots employed by Google and Bing are able to crawl and index your websites. If you disable the Googlebot your website, it will ultimately be deleted from Google’s index, and they will be unable to access it, and your site won’t be listed. – Block Bots Spiders Htaccess
Other bots can be used in more specific ways. There are bots which are designed to crawl sites that sell online, in search of discounts. They will cross-reference each e-shop they find for the same product, meaning that the website that is the main one can show the price of the item in a range of stores. Some websites will make use of them to make sure they’re on highest of their list. This is the reason why many Amazon listings are slowly creeping down by a few cents per day. Competing sellers will out-list one another by adjusting prices down by one cent or two at one time. – Block Bots Spiders Htaccess
Other bots are not as safe. Spam bots will browse blogs in search of commenting systems that they are able to take advantage of. Comments that do not have authentication or captchas may be filled out by bots and spam posts can be left in order to add link juice for spam sites or to entice clicks from inexperienced internet users, or destroy a website with bad SEO. – Block Bots Spiders Htaccess
Hackerbots are able to browse the internet looking for sites infrastructure. They examine domains for common /admin.htm like URLs, looking for sites that have an default CMS and don’t have any changes like usernames or passwords. They look for sites that are vulnerable which are the ones with low risk that they are able to gain access to and take advantage of. They may collect the user or admin information or simply forward URLs back to the creator of the bot. They may be programmed to shut down a site in order to create their own. – Block Bots Spiders Htaccess
The malicious bots originate of computer virus. They infect the computer of the user and at times, either within the background utilize the internet access feature of the computer to accomplish whatever the person who is the source of the virus would like to do. In most cases, it is employed to hit a specific URL with an DDoS attack, with the intention of taking down the site or causing enough stress to the server that hackers can get inside through a vulnerability within the program. Learn more about the top antivirus software that is available, and the difference between AVG and Avast. – Block Bots Spiders Htaccess
scraper robots are also malicious and behave as spiders that scrape content. Instead of placing it into a search engine index, they copy the content in bulk. Content such as scripts, media, and more are all downloaded and put on the spammer’s server which they then use to turn into – or simply copy and paste the content wholesale for their spam websites. All of it is disposable to them as a resource that they can harvest and then dispose of after a time when it’s no more useful. – Block Bots Spiders Htaccess –
There’s clearly a lot wrong with the bots. Apart from their intended use however, they also produce another effect; strain on servers. Bots might be capable of accessing your website in a stripped-down and lightweight way, as search engine bots usually do, but even if they can it, they’re still accessing your website. They’ll still download content, send requests to your server, and in general, consume resources.
In many instances this could make a website go under. There are reports of websites that were smashed by Google alone and destroyed, but Google is usually clever enough to not do this. With the immense amount of bot traffic that is circulating on the web there’s plenty to fight. – Block Bots Spiders Htaccess
It’s not even to discuss the problems in data analysis that will come after. It’s an enormous challenge to remove bots in Google Analytics, so the data you’re able to examine is actually human-generated and not the use of software. – Block Bots Spiders Htaccess
There are two methods to prevent bots from trying to gain access to your website. The first is via your robots.txt file, the other one is via using the .htaccess file. – Block Bots Spiders Htaccess
As you may have already guessed in the headline of the article I’m going to focus on the second. In the beginning, let’s discuss robots.txt. What exactly is an robots.txt file?
The robots.txt document is an text file that you place in the root directories in your web server. Its goal is to provide instructions to bots who want to gain access to your website. It can be used to prevent bot access for instance, to specific bots or all bots. Why not make use of it? – Block Bots Spiders Htaccess
The problem in robots.txt lies in the way that it’s providing instructions to bots. If bots decide not to obey it – this I mean, the bot’s creator programmed the bot to disregard robots.txt it isn’t able to take action. It’s similar to opening your front door with the sign that reads “robbers are not welcome.” When the thief decides to ignore the warning and continues to walk through the gate, there’s nothing stopping them from stepping into the gates. – Block Bots Spiders Htaccess
The .htaccess file is an configuring file which is utilized to configure Apache. Apache webserver software. It’s much more like an officer at the gate in front and actively preventing possible criminals. However, in this scenario the security guard is given the capability of determining whether the person trying to get in comes from RobberHome or wears a shirt that reads “I’m a criminal” or is identifying the person. – Block Bots Spiders Htaccess
This means it means that .htaccess file will block the majority of bots, but it is not the case for all bots. Particularly, the botnet bots – computers that are slaved from regular users – are typically unblockable by default. This is because they’re normal user computers that run the same software that is used by regular users. By blocking them, you’re also blocking human beings. For the majority of bots, it’s the .htaccess file is ideal. – Block Bots Spiders Htaccess
Be aware that the .htaccess file is only performed if your website server runs Apache. When you’re working with Nginx, Lighttpd, or any of the other servers that are niche it is necessary to determine the method used by your software to block bots.
Identifying Bots to block
Before we begin, we’d like to warn you. Be cautious when you block bots using this .htaccess file. A single mistake and you could end in blocking the entire Internet. You don’t want that. – Block Bots Spiders Htaccess
The first thing to accomplish is restore the existing .htaccess files. In the case of an error that blocks the flow of traffic you don’t want to be blocked You can restore the original file to reverse the changes until you figure out the issue. – Block Bots Spiders Htaccess
The next thing you need to accomplish is find out the best way to locate your own access logs. With Apache, you’ll need an unix/linux command to access your log files. You can learn how to access the log file in this article. – Block Bots Spiders Htaccess
Utilizing this method and the resulting log file which records the server’s access in lots of detail. It will display the IP address used to connect to this server. It will also show the name of the client’s computer when it is known as well as the ID of the user of the machine in the event that it was authenticated and the time for the query, if it was made through HTTP or HTTPS, the status code the server returned, as well as how big the item sought. It will likely be a massive document. – Block Bots Spiders Htaccess
The log file will contain information on all your users who are regular to you, as well as every bot you have access to. Certain bots, such as those from Google bots, be identified by their user agent details. Bots that are malicious may recognize themselves, but usually simply have certain traits that make them appear to be non-human. They could use an outdated edition of an internet browser that is recognized as being targeted. They may be coming from known spam domains or addresses. – Block Bots Spiders Htaccess
This article is quite good at aiding you to identify the log entries that are not good bots and which are good bots or users.
In general, if the bot only visits your website once per month, you don’t have to be concerned about it. You may block it if you wish, however it won’t help you save the time and effort. Your main goal should be to stop the bots who are constantly on the site and cause an adverse effect upon the speed of performance on your website. – Block Bots Spiders Htaccess
Be careful when blocking an the IP address of your computer or. It’s easy to detect a variety of bots coming from such as 168 .*.*. *, which is a number of numbers within the stars. And then you imagine “I could block them all! Block the entire range of /8!” The problem is the 8 range in IPv4 includes 16,777,216 IP addresses that could be utilized for legitimate purposes. It’s possible to block a large quantity of traffic that is legitimate using one rule that is overly broad.
The majority of entries within the .htaccess file will not block through an IP address simply because an IP address is not easily changed via proxy services. The majority will employ names for user agents, persistent IP addresses of bots that aren’t concerned about alter, and domains that are commonly used for hosting spambots or hacker tools. – Block Bots Spiders Htaccess
Utilizing The .htaccess File
There are three options we’ll use to block bots using this .htaccess file. The first method is the most popular, employing users’ agent to stop it. This is generally safe because normal users don’t be aware of a bot’s user agent. – Block Bots Spiders Htaccess
Within the .htaccess file, you’ll first need a line that reads “RewriteEngine is on”. This line ensures that the rewrite lines following will be able to function, rather than being read as comments. – Block Bots Spiders Htaccess
Then step, you can include “RewriteCond http://www.user.agent.com/rewrite” as a separate line. This will allow a rewrite dependent on the user’s agent. There are two options available here: you can either add tons of different user agents to this line, or add one agent, and after that add the line. Example:
RewriteCond %HTTP_USER_AGENT 12so| 192\.comagent\ 1noonbot\ 3de\_search2 RewriteCond %HTTP_USER_AGENT [Ww RewriteCond %HTTP_USER_AGENT RewriteCond $ [NC,OR] *Acunetix (NC, OR) RewriteCond %HTTP_USER_AGENT binRewriteCond %HTTP_USER_AGENT BlackWi [NC, ORRewriteCond %HTTP_USER_AGENT BlackW
Both are fine. In the first case you’ll need to include another RewriteCond line for every 500 or as many entries. The reason for this is that the more lines you have of just one line, the more difficult it will be to Apache to understand. Separating it into smaller entries creates more clutter however, it is also more readable. However, you can choose any method.
There are two bits at the end of each line. NC as well as the OR parts at the conclusion of the entry are flags for rewriting. NC is “nocase” that means the entry isn’t case-sensitive. This means that “12soso” or “12Soso” will be treated in the same as they are. OR is “this or this or” which means that the bot is stopped if it matches one or the other of the entries in the list, in contrast to “AND” that would include all of them.
Following your list of bots below, you must indicate the rule of rewriting. This is the first portion of a two-part phrase If the URL matches this and the rule is met, it will… Second clause is what happens. Include “RewriteRule . * [F,L]” on its own line.
What it accomplishes it redirects any traffic coming in through the bot’s users agent onto a restricted webpage. Specifically, it will send the 403 Forbidden code. The [F] code is forbidden while the “L” is symbol informing the user that the rewrite rules must be implemented immediately, not after the remainder of .htaccess file has been processed.
The two other methods are blocking by HTTP referrer, or blocking by IP address.
To block by HTTP referrer, use “RewriteCond %” as the starting line, use the domain of the exploitative referrer like www1.free-social-buttons\.com, and use the [NC,OR] block. Include the RewriteRule line later. It will result in something similar to:
RewriteCond % www4.free-social-buttons\.com RewriteRule ^. * - [F,L]
Then, you can simply block an IP address based. If you notice an IP address that is particularly harmful, threatening your website a hundred times per hour or more you are able to stop it. Write “Deny *” and then write “Deny *.*.*. *” in which the stars represent your IP. It’ll appear as “Deny at 220.127.116.11” and possibly an /28 or some other number at the end , to stop a range.
If this all seems too complicated it’s possible to make a quick buck and use lists others have created. I’ve come across two that I would recommend. First , there’s an entry in the Pastebin on HackRepair.com. Second one is the following list taken from Tab Studio.
If you are adding blocks using the .htaccess file, you should be sure you test access to your website by several different methods before you do. If you’re getting blocked in a way that you shouldn’t there’s something wrong and you have to correct the entry.