Still Don't Like Our AV Study? A Response to The Critics
Imperva CTO Amichai Shulman:
Let me start by saying that I’m not a big fan of back and forth argumentative discussions taking place in the blogosphere. However, the religious rage that erupted over the past couple of weeks with respect to our paper, Assessing the Effectiveness of Antivirus Solutions, compels me to provide some response.
Trying to avoid dragging the reader through a lengthy text full of complex arguments I’ll try to take this backwards (kind of the “Backwards Episode” from Seinfeld). The bottom line is in fact that many people have questioned the core aspects of our research: choice of malware samples and method of sample evaluation. However, even among those who have questioned our methodology, there seems to be a consensus around our conclusions – that standard AV solutions have reached the point of diminishing returns and organizations must shift their investments towards other solutions that protect organizations from the effects of infection. I have to assume that if our methodology leads us in a logical way to conclusions that are so widely acceptable, it can’t be all that wrong.
Criticism #1: Sampling
The first part of the criticism targeted our choice of malware samples. Let me again put forward the bottom line – our critics basically claim that our results are so different than theirs because the method we used to collect the samples is incorrect. Let me put this in different words – if attackers choose malware the way AV vendors instruct them to, detection rates become blissful. If attackers choose malware in a different manner, you’re toast.
Poor sampling would be a fair argument to make if we used some mysterious technique for collecting malware that can only be applied by high end criminal gangs. That is, of course, not the case. We used Google searches with terms that get us close to sporadic malware repositories in publicly accessible web pages. We salted that with some links we obtained through sporadic searches in soft-core hacker forums. We did focus on Russian language forums, but I do not believe that this is controversial. Meanwhile, the “cream of the crop” was supplied by some links we took from traffic obtained through anonymous proxies. All this collection work was done by unbiased people, those who are NOT in the business of hacking nor employed by antivirus companies.
Moreover if we inspect the claim made by antivirus vendors with respect to what is the “right” set of malware samples, it actually supports our finding. They claim that if you take the sample size they are dealing with – 100K per day, they achieve higher than 90% detection (98% according to one report). That is – they miss 2,000 malware samples out of 100K. How hard do you think it is for an attacker (and I intentionally did not add the term “skilled”) to get his hands on a couple of those 2,000 undetected samples? I should add that all the samples that we included in our statistics—out of the samples that we’ve collected and tested—are those that were eventually detected by a large enough sample of AV products, and that none of them was a brand new malicious code – rather they were all variations and instances of existing malware.
Criticism #2: Using VirusTotal
The second part of the criticism touches on our use of VirusTotal.com (VT) as a tool for conducting an experiment related to AV effectiveness. We recognize the limitations of using VT, and described those limitations in our paper. However, bottom line first – we are not the first one to publish comparative studies of AV efficiency or to publish some analysis of AV efficiency based on VT. We drew explicit conclusions that are not put in technical terms but in plain business terms – organizations should start shifting their budgets to other solutions for the malware infection problem.
The first and foremost statement made by critics is “you should not have used VT because they say so." Again, here’s the bottom line – we have used VT in a prudent and polite way. We did not use undocumented features, we did not subvert APIs and we did not feed it with data with the purpose of subverting results of AV vendor decisions (which is an interesting experiment on its own). So basically, our wrongdoing with respect to VT is the way we interpreted the results and the conclusions we drew from them – going against this has no other term but “thought police”. This is of course before mentioning the fact that various recent reports and publications have been using VT for the same purpose (including Brian Krebs). I know that VT do not claim or pertain to be an anti-malware detection tool and that VT is not intended to be used as an AV replacement. However, they cannot claim to only be a collection tool for the AV industry with results provided per sample being completely meaningless. I must add that having an upload / get results API further disproves that claim. I deeply regret being dragged into this debate with VT since I truly value their role in the anti-malware community and have the utmost respect to their contribution to improvements of AV detection techniques and malware research.
One of the most adamant arguments against the validity of VT as a measurement for effectiveness is that it uses the command-line version of AV products and that configuration may not be ideal. I’d like to quote:
- VirusTotal uses command-line versions: that also affects execution context, which may mean that a product fails to detect something it would detect in a more realistic context.
- It uses the parameters that AV vendors indicate: if you think of this as a (pseudo)test, then consider that you’re testing vendor philosophy in terms of default configurations, not objective performance.
- Some products are targeted for the gateway: gateway products are likely to be configured according to very different presumptions to those that govern desktop product configuration.
- Some of the heuristic parameters employed are very sensitive, not to mention paranoid.
Regarding the first point, I personally do appreciate the potential difference between a command-line version of an AV tool and other deployed versions. However, in terms of signatures and reputation heuristics I don’t really get it. I’d love to see AV vendors explain that difference in details and in particular pointing out which types of malware are not detected by their command line version that are detected by their other version and why. I am certainly willing to accept that our results would have been somewhat different if tested an actually installed version of the product that is not the command-line version. However, I do think that they are a good approximation. If AV vendors claim that this is by far untrue I’d really like to see the figures. Is the command-line version 10%, 50% or 90% less effective than the product?
I don’t see the point in the second argument. Are they really claiming that VT configuration is not good because it is the recommended vendor configuration?
As for the third argument, this is really puzzling. According to this, we should have experienced a high ratio of false positives, rather than the high ratio of false negatives that we have observed in practice.
VirusTotal is self-described as a TOOL, not a SOLUTION: it’s a highly collaborative enterprise, allowing the industry and users to help each other. As with any other tool (especially other public multi-scanner sites), it’s better suited to some contexts than others. It can be used for useful research or can be misused for purposes for which it was never intended, and the reader must have a minimum of knowledge and understanding to interpret the results correctly. With tools that are less impartial in origin, and/or less comprehensively documented, the risk of misunderstanding and misuse is even greater.
Again, the writer agrees that VT is indeed a tool that can be used for research as long as results are correctly interpreted. Yes, it is possible that we’ve misinterpreted the results. If that is your opinion then argue with our interpretation of the results. Unfortunately most critics chose not to do so, but rather argued that we used the wrong tools.
I could continue, however, I think that I’ve addressed the main criticism against our work and shown that most of it is of immaterial nature. I would like to see a livelier debate around our interpretation of the results and the conclusion – AV solutions attempting to prevent infection have reached a point of diminishing returns and are thus providing attackers with a large enough window of opportunity time-wise and device-wise to penetrate organizations and remain undetected for extremely long periods. It does not mean that we have to throw AV solutions away, it just means that we need to start shifting some of the money towards solutions that detect and prevent the effects of infection.
Authors & Topics: