The leading issues affecting most websites today is spam and ever increasing bot generated traffic. It is now estimated that nearly 50% of all the world’s internet traffic is generated by bots. This traffic affects the user experience for your visitors, and may have a negative impact on your online business.
We have previously posted about the 5 reasons why Bot traffic is bad for your business website and I have personally previously highlighted the issues caused by Referer spam and also provided instructions on how to defend against referer spam by making use of your Apache
If you are completely new to using
.htaccess you may find this blog post extremely useful, as I will discuss the basics of the
.htaccess and how it can help to eliminate Referer Spam.
What is .htaccess
.htaccess is a configuration file, used by web servers running Apache Web Server software, to alter the configuration of the web server to enable/disable additional functionality and features that Apache Web Server has to offer.
Apache web server, is an extremely powerful and flexible web server software, and comes packed with a lot of excellent and powerful features. However, not all of this functionality is enabled straight out of the box, mostly due to the fact that not all websites need all of the powerful features. It is usually the responsibility of the web master, to enable and configure the features they need.
These facilities may include basic redirect functionality, for instance if a
404 file not found error occurs, or for more advanced functions such as content password protection or image hot link prevention.
Why is it called .htaccess ?
.htaccess files were first used to control user access on a per directory basis, using a subset of Apache’s
http.conf settings directives, allowing system administrators to restrict access to individual directories to users with a name and password specified in an accompanying
.htaccess are still used for this, however they are now also used for a number of other things!
Where is the .htaccess file
In theory, any folder or directory on your web server could have an
.htaccess. However, typically most websites will have at least one
.htaccess within the main a.k.a root folder i.e.
public_html and one in each sub directory.
Why can’t I find my .htaccess file
If you are accessing your websites files and folders making use of CPanel and other common file systems, file names that begin with a (.) are regarded as hidden files, so typically are not visible by default.
Your FTP client or File Manager should have a setting for “show hidden files.” This will be in different places in different programs, but is usually in “Preferences”, “Settings”, or “Folder Options.” Sometime you’ll find it in the “View” menu.
What if I don’t have an .htaccess file
If you have created yoru website making use of a Content Management System (CMS) like WordPress or Drupal, the chances are that those applications would’ve provided an
.htaccess file. Then you should ensure you have activated show hidden files or the equivalent in your FTP file manager.
Most web hosts will usually create an
.htaccess files automatically for you. So you should usually have one. If you still can’t find an
.htaccess file, then it is also a good idea to check to confirm that your web server is an Apache Web Server, because it is highly likely that it could be an alternative Web Server like nginx or even IIS. In which case these web servers do not make use of
If you have confirmed that, your web server is an Apache Web Server, and you have confimed that all hidden files are shown, and you still cannot find an
.htaccess file, then don’t worry they are very easy to create.
- Start a new file in a plain text editor.
- Save it in ASCII format (not UTF-8 or anything else) as .htaccess.
- Make sure that it isn’t htaccess.txt or something like that. The file should have only the name .htaccess with no additional file extension.
- Upload it to the appropriate directory via FTP or your browser-based file manager.
IP BLacklisting and IP Whitelisting
You can use .htaccess to block users from a specific IP address (black-listing). This is useful if you have identified individual users from specific IP addresses which have caused problems.
You can also do the reverse, blocking everyone except visitors from a specific IP address (white-listing). This is useful if you need to restrict access to only approved users.
Black-listing by IP
To block specific IP addresses, simply use the following directive, with the appropriate IP addresses:
The first line states that the allow directives will be evaluated first, before the deny directives. This means that allow from all will be the default state, and then only those matching the deny directives will be denied.
If this was reversed to order
deny,allow then the last thing evaluated would be the allow from all directive, which would allow everybody, overriding the deny statements.
Notice the third line, which has deny from 789.56.4. — that is not a complete IP address. This will deny all IP addresses within that block (any that begin with 789.56.4).
You can include as many IP addresses as you like, one on each line, with a deny from directive.
White-listing by IP
The reverse of blacklisting is white-listing — restricting everyone except those you specify.
As you may guess, the order directive has to be reversed, so that that everyone is first denied, but then certain addresses are allowed.
Domain names instead of IP addresses
You can also block or allow users based on a domain name. This can be help block people even as they move from IP address to IP address. However, this will not work against people who can control their reverse-DNS IP address mapping.
This works for subdomains, as well — in the previous example, visitors from xyz.example.com will also be blocked.
Block Users by Referrer
A referrer is the website that contains a link to your site. When someone follows a link to a page on your site, the site they came from is the referrer.
This doesn’t just work for clickable hyperlinks to your website, though. Pages anywhere on the internet can link directly to your images hotlinking — using your bandwidth, and possibly infringing on your copyright, without providing any benefit to you in terms of traffic. They can also hotlink to your CSS files, JS scripts, or other resources.
Most website owners are okay with this when happens just a little bit, but sometimes this sort of thing can turn into abuse.
Additionally, sometimes actual in-text clickable hyperlinks are problematic, such as when they come from hostile websites.
For any of these reasons, you might want to block requests that come from specific referrers.
To do this, you need the
mod_rewrite module enabled. This is enabled by default for most web hosts, but if it isn’t (or you aren’t sure) you can usually just ask your hosting company. (If they can’t or won’t enable it, you might want to think about a new host.)
.htaccess directives that accomplish referrer-based blocking rely on the mod_rewrite engine.
The code to block by referrer looks like this:
Lets explain what is going on here.
The first line,
RewriteEngine on, alerts the parser that a series of directives related to rewrite is coming.
The next three lines each block one referring domain. The part you would need to change for your own use is the domain name (somehackydomain) and extension (.com).
The backward-slash before the .com is an escape character. The pattern matching used in the domain name is a regular expression, and the dot means something in RegEx, so it has to be “escaped” using the back-slash.
The NC in the brackets specifies that the match should not be case sensitive. The OR is a literal “or”, and means that there are other rules coming. (That is — if the URL is this one or this one or this one, follow this rewrite rule.)
The last line is the actual rewrite rule. The [F] means “Forbidden.” Any requests with a referrer matching the ones in the list will fail, and deliver a
403 Forbidden error.
Blocking Bots and Web Scrapers
One of the more annoying aspects of managing a website is discovering that your bandwidth is being eaten up by non-human visitors — bots, crawlers, web scrapers. These are programs that are designed to pull information out of your site, usually for the purpose of republishing it as part of some low-grade SEO operation.
There, of course, legitimate bots — like those from major search engines. But the rest are like pests that just eat away at your resources and deliver no value to you whatsoever.
There are several hundred bots identified. You will never be able to block all of them, but you can keep the activity down to a dull roar by blocking as many as you can.
These are just some of the advanced features you can make use of in Apache Web Server to reduce bot and referer spam traffic to your website. Yet we have only just scrapped the tip of the Ice berg and there are many more features available.
.htaccess file for the first time can give you sudden feeling of immense power over your web hosting environment. You may even start to feel like a god amongst mere mortals.
Unfortunately, this power can go to your head, and you may find yourself using the
.htaccess file in ways that aren’t really the best.
As with most things in IT, there is always more than one way to do things, and there are always better alternative approaches. It may be worth exploring other areas in your web stack to make amendments
Hosting Provider Configuration
You hosting proivder my elect to place the directives you place in an
.htaccess file in the
httpd.conf file, which is a configuration settings file for the entire server.
Similarly, PHP settings more properly belong in the
php.ini file, and most other languages have similar configuration setting files.
Placing directives further upstream, in the
php.ini, or other language-specific configuration file allows those settings to be “baked-in” to the web server’s parsing engine. With
.htaccess, the directives have to be checked and interpreted with every single request.
If you have a low traffic site with only a handful of
.htaccess directives, this isn’t a big deal. But if you have a lot of traffic, and a lot of directives, the performance lag can really add up.
Unfortunately, many shared hosting providers do not allow customers to access the
php.ini files, forcing users to rely on the slower
.htaccess file. This provides a double-penalty when compared to custom VPS configurations because shared hosting is also generally low-powered. This is one of the reasons that a site with traffic or above 10000+ hits a day should probably be on a VPS plan instead of shared hosting plan.
If you are using a good Content Management System (CMS) such as WordPress or Drupal, some of the things you might do in an
.htaccess file — such as redirect URLs or block IP addresses — can be done from inside the application. For instance, if you want to block referer spam in WordPress you may be better off making use of Our Free Stop Web Crawlers Plugin available on the WordPress Plugin Directory.
This works in conjunction with the
.htaccess file, with the application programatically adding directives, it is usually best to accomplish these tasks from inside the application, rather then editing the
.htaccess file yourself. You are less likely to introduce bugs and incompatible directives if you use a well-tested, open source plugin, rather than editing it yourself.
A unique background as business owner, marketing, software development and business development ensures that he can offer the optimum business consultancy services across a wide spectrum of business challenges.
Latest posts by Gary Woodfine (see all)
- How to install DynamoDb on local Ubuntu Development - April 17, 2019
- Why every business needs a web maintenance retainer - March 11, 2019
- Connect AWS Lightsail SSH with ubuntu terminal - March 1, 2019