Introduction to Squid

This article will cover basic Squid configuration and explain a few fundamental concepts about cache hierarchies.

Before we can start configuring, we need to install the Squid proxy. For Debian, download the latest Squid .deb available from the web section. For Redhat, download the latest .rpm available from the Redhat Contrib network. Install these packages as usual.

To compile Squid from source, get the latest tar ball from Squid homepage and follow the instructions included. At time of writing, the latest version is 2.2PATCH4.

The location of the configuration file is dependent on how you installed it. If you installed a package (either .deb or .rpm) it should be /etc/squid.conf. If compiled by source, it will be $SQUID_ROOT/etc/squid.conf (usually /usr/local/squid/etc/squid.conf). The configuration file provided is possibly the best reference, as all directives are commented very well.

One important directive that is often overlooked is ``cache_mgr''. This specifies who to contact if there is a problem with the proxy, and is shown on any error messages. Set this to an email address that is checked regularly (preferably by more than one person), as this ensures any problems with the proxy can be reported to the correct person.

There are a few ports that you can set - the HTTP port, the ICP port, and the HTCP port. The HTTP port is used to specify the port that the proxy listens for requests from the clients. ICP, or Internet Cache Protocol, is used for querying other proxies for information, such as if they have a particular object on their disk. HTCP, or Hyper Text Cache Protocol, is a protocol for discovering caches, managing them and monitoring their activities, and has much more functionality than ICP.

Set these by use of the http_port (common port for this is 8080), icp_port (common for this is 3130) and htcp_port directives. Note that with http_port you can specify multiple ports for Squid to listen on, which can be useful for other proxies using your proxy as a parent.

One important concept that must be understood is that of parents and siblings. A sibling is a cache that your proxy, when it receives a request for a URL, sends a query to to see if it has a copy of it. The sibling then sends back either ``Yes, I have it'' or ``No, I don't have it''. The proxy then decides if it should retrieve this object from a sibling, or go get it from the source directly. A parent is a proxy that, if none of the siblings have a copy of the object you want, your proxy opens a request to and asks the parent to go get a copy for it, rather than fetching it directly.

Squid can talk to other proxies, both as a parent (generally a proxy with a better network connection) or a sibling. To set this up, add the following line:

	cache_peer hostname type http_port icp_port <options>

For example, for an imaginary domain foo.com, with a parent proxy at the ISP of proxy.bar.com, running on port 8080 the line would be as follows:

	cache_peer proxy.bar.com parent 8080 3130

There are several options you can set with the cache_peer directive which influence how these proxies are used. Use ``proxy-only'' to specify that any objects fetched are not to be saved locally - this is useful when talking to a ``local'' parent proxy, or if you have multiple proxies. To set a weighting for a proxy, use ``weight=n'' - the higher the weighting, the more often the proxy will be hit. This is useful for load balancing over proxies of differing capabilities. If the parent proxy requires authentication (for example, at a college) use ``login=user:password''. Obviously this is only useful if your proxy is a personal, or work group proxy.

Another useful feature is specifying which domains are handled by which proxies - this can be useful if some proxies have better connectivity to certain areas. This is set up using the following directive.

	cache_peer_domain cache.foo.org .edu

This means any requests for .net and .com will first go through cache.foo.org.

It is also possible to modify the neighbourhood type, dependent on the domain name. For example, you can have a parent proxy that is only used as a sibling for certain domains, or vice versa.

	neighbor_type_domain cache.bar.org sibling .net .com

Squid uses the ``cache_dir'' directive to specify the directories to be used to storing cached ``objects''.

For performance reasons, Squid writes each object (eg, http://url.to.this/article) that it considers worth caching as a numbered file. (As a side note, the first 'cache_dir' directive found in the squid.conf file is also used as the place to store a (separate) hash index file into all of the objects stored in the cache.)

The basic format of this directive is:

        cache_dir /directory/to/use x y z

where:

	X - The size (in megabytes) of this cache_dir.  Squid does
	not take into account file system overheads, so the actual
	space used (as shown by 'du') may be larger than this value.
	Y - Number of first level directories (acceptable default is 16)
	Z - Number of second level directories (acceptable default 256)

Multiple instances of this directive can be given, to spread the cache across multiple drives, or trying to get the most out of an existing file system layout.

However, try to avoid specifying multiple cache_dir's which are on the same physical drive. Doing this leads to possible excessive thrashing of the drive heads as Squid tries to spread the load evenly across multiple cache_dirs (and the drive gets told to access first this side of the drive then the other). The same applies to multiple drives on limited I/O channels (eg, IDE).

Another essential setting is where the logs are stored - this keeps information about who is accessing what, and what files are being stored. It is useful to watch these logs for any obvious problems, and for analysis by such programs as Calamaris.

	cache_access_log /var/log/squid/access.log
	cache_log /var/log/squid/cache.log
	cache_store_log /var/log/squid/store.log

To protect the clients, there are a couple of useful options. You can hide the user agent string - which specifies which browser the client is using. Additionally, you can specify if squid includes a header which includes the ip address of the client. By using both of these as follows, you can reduce the amount of information that the proxy passes on to the server.

	anonymize_headers deny From Referer Server
	anonymize_headers deny User-Agent WWW-Authenticate Link
	forwarded_for off
	fake_user_agent none

This basic Squid configuration should be enough to get your proxy up and running. There are many more options than covered here, and the default Squid config file covers many of these. The next article will cover proxy authentication, and delay pools.

Brad Marshall is a Systems Administrator for Plugged In Software, a software development company based in Brisbane, Australia. He has been using Linux for more than 5 years, and has a BSc majoring in Computer Science.