Imperva CTO Amichai Shulman:
Let me start by saying that I’m not a big fan of back and
forth argumentative discussions taking place in the blogosphere. However, the
religious rage that erupted over the past couple of weeks with respect to our
paper, Assessing the Effectiveness of Antivirus
Solutions, compels me to provide some response.
Trying to avoid dragging the reader through a lengthy text
full of complex arguments I’ll try to take this backwards
(kind of the “Backwards
Episode” from Seinfeld). The bottom line is in fact that many people have
questioned the core aspects of our research: choice of malware samples and
method of sample evaluation. However, even among those who have questioned our
methodology, there seems to be a consensus around our conclusions – that
standard AV solutions have reached the point of diminishing returns and
organizations must shift their investments towards other solutions that protect
organizations from the effects of infection. I have to assume that if our methodology
leads us in a logical way to conclusions that are so widely acceptable, it
can’t be all that wrong.
Criticism #1: Sampling
The first part of the criticism targeted our choice of
malware samples. Let me again put forward the bottom line – our critics
basically claim that our results are so different than theirs because the
method we used to collect the samples is incorrect. Let me put this in
different words – if attackers choose malware the way AV vendors instruct them
to, detection rates become blissful. If attackers choose malware in a different
manner, you’re toast.
Poor sampling would be a fair argument to make if we used
some mysterious technique for collecting malware that can only be applied by
high end criminal gangs. That is, of course, not the case. We used Google
searches with terms that get us close to sporadic malware repositories in
publicly accessible web pages. We salted that with some links we obtained
through sporadic searches in soft-core hacker forums. We did focus on Russian language forums, but I
do not believe that this is controversial.
Meanwhile, the “cream of the crop” was supplied by some links we took
from traffic obtained through anonymous proxies. All this collection work was
done by unbiased people, those who are NOT in the business of hacking nor
employed by antivirus companies.
Moreover if we inspect the claim made by antivirus vendors
with respect to what is the “right” set of malware samples, it actually
supports our finding. They claim that if you take the sample size they are
dealing with – 100K per day, they achieve higher than 90% detection (98%
according to one report). That is – they miss 2,000 malware samples out of
100K. How hard do you think it is for an attacker (and I intentionally did not
add the term “skilled”) to get his hands on a couple of those 2,000 undetected
samples? I should add that all the samples that we included in our statistics—out
of the samples that we’ve collected and tested—are those that were eventually detected
by a large enough sample of AV products, and that none of them was a brand new
malicious code – rather they were all variations and instances of existing
malware.
Criticism #2: Using VirusTotal
The second part of the criticism touches on our use of VirusTotal.com (VT) as a
tool for conducting an experiment related to AV effectiveness. We recognize the
limitations of using VT, and described those limitations in our paper. However, bottom line first – we are not the
first one to publish comparative studies of AV efficiency or to publish some
analysis of AV efficiency based on VT. We drew explicit conclusions that are
not put in technical terms but in plain business terms – organizations should
start shifting their budgets to other solutions for the malware infection
problem.
The first and foremost statement made by critics is “you
should not have used VT because they say so." Again, here’s the bottom line – we
have used VT in a prudent and polite way. We did not use undocumented features,
we did not subvert APIs and we did not feed it with data with the purpose of
subverting results of AV vendor decisions (which is an interesting experiment
on its own). So basically, our wrongdoing with respect to VT is the way we
interpreted the results and the conclusions we drew from them – going against
this has no other term but “thought police”. This is of course before mentioning
the fact that various recent reports and publications have been using VT for
the same purpose (including Brian Krebs). I know that VT do not claim or
pertain to be an anti-malware detection tool and that VT is not intended to be
used as an AV replacement. However, they cannot claim to only be a collection
tool for the AV industry with results provided per sample being completely
meaningless. I must add that having an upload / get results API further disproves
that claim. I deeply regret being dragged into this debate with VT since I
truly value their role in the anti-malware community and have the utmost respect
to their contribution to improvements of AV detection techniques and malware
research.
One of the most adamant arguments against the validity of VT
as a measurement for effectiveness is that it uses the command-line version of AV
products and that configuration may not be ideal. I’d like to quote:
- VirusTotal uses command-line versions: that
also affects execution context, which may mean that a product fails to detect
something it would detect in a more realistic context.
- It uses the parameters that AV vendors
indicate: if you think of this as a (pseudo)test, then consider that you’re
testing vendor philosophy in terms of default configurations, not objective
performance.
- Some products are targeted for the gateway:
gateway products are likely to be configured according to very different
presumptions to those that govern desktop product configuration.
- Some of the heuristic parameters employed are
very sensitive, not to mention paranoid.
Regarding the first point, I
personally do appreciate the potential difference between a command-line
version of an AV tool and other deployed versions. However, in terms of
signatures and reputation heuristics I don’t really get it. I’d love to see AV
vendors explain that difference in details and in particular pointing out which
types of malware are not detected by their command line version that are
detected by their other version and why. I am certainly willing to accept that
our results would have been somewhat different if tested an actually installed
version of the product that is not the command-line version. However, I do
think that they are a good approximation. If AV vendors claim that this is by
far untrue I’d really like to see the figures. Is the command-line version 10%,
50% or 90% less effective than the product?
I don’t see the point in the second argument. Are they
really claiming that VT configuration is not good because it is the recommended
vendor configuration?
As for the third argument, this is really puzzling.
According to this, we should have experienced a high ratio of false positives,
rather than the high ratio of false negatives that we have observed in
practice.
Quoting again:
VirusTotal
is self-described as a TOOL, not a SOLUTION: it’s a highly collaborative
enterprise, allowing the industry and users to help each other. As with any
other tool (especially other public multi-scanner sites), it’s better suited to
some contexts than others. It can be used for useful research or can be misused
for purposes for which it was never intended, and the reader must have a
minimum of knowledge and understanding to interpret the results correctly. With
tools that are less impartial in origin, and/or less comprehensively
documented, the risk of misunderstanding and misuse is even greater.
Again, the writer agrees that VT is indeed a tool that can
be used for research as long as results are correctly interpreted. Yes, it is
possible that we’ve misinterpreted the results. If that is your opinion then
argue with our interpretation of the results. Unfortunately most critics chose
not to do so, but rather argued that we used the wrong tools.
Epilogue
I could continue, however, I think that I’ve addressed the main criticism
against our work and shown that most of it is of immaterial nature. I would
like to see a livelier debate around our interpretation of the results and the
conclusion – AV solutions attempting to prevent infection have reached a point
of diminishing returns and are thus providing attackers with a large enough
window of opportunity time-wise and device-wise to penetrate organizations and
remain undetected for extremely long periods. It does not mean that we have to throw
AV solutions away, it just means that we need to start shifting some of the
money towards solutions that detect and prevent the effects of infection.