Home > General > Affiliate Marketing Forum

I analysed bot traffic with the latest security techniques. Here's what I found. (16)

08-28-2017 11:59 AM #1 caurmen (Administrator)
I analysed bot traffic with the latest security techniques. Here's what I found.

Over the last couple of weeks, based on this article about detecting headless browsers and this linked article about detecting PhantomJS, I've been doing a bunch of research into bot traffic.

What started out as a "let's write a better bot blocker" project turned into more of an open-ended investigation into the kind of bot traffic we see on popular traffic sources.

What's the code behind fraudulent traffic? How sophisticated is it? And how can you detect or block it?

Even in this early-stage research, which was conducted on a single popular traffic source, I found some pretty surprising results. I initially tested using very low bids in a few countries including the USA, which I know from experience is a good way to get lots and lots of bot placements. Subsequently I verified my tests using a very high bid on premium traffic.

All tests were conducted on wifi traffic to reduce false positives from click loss, and because wifi traffic tends to have more bots.

TL: DR Summary

Sophisticated bot traffic appears to be very, very rare.
The single most effective simple test for bot traffic was "can it parse Javascript".
"Does it even download my landing page" is another effective test.
navigator.languages appears to detect some probably-bot traffic.

Findings: Point By Point

Sophisticated Headless Browsers are nowhere to be seen

I implemented most of the techniques in the above article, including WebGL vendor detection, navigator.languages checking, missing JS functions, suspiciously fast alert box closing and of course useragent checking.

I didn't have much hope that useragent checking would work, because that's a really obvious thing to spoof, but some of the other checks would be an absolute pain to bypass.

The results surprised me: in tens of thousands of hits, I didn't detect a single definitive instance of either headless Chrome or PhantomJS! I also saw precisely zero bots rapidly closing alert boxes. This applied both to cheap traffic and high-end premium traffic.

Interesting. But it gets more interesting...

Here's the source code for the most common bot I saw...

After the initial surprising results above, I thought a bit about what could be causing them.

What's the simplest "bot" one could possibly use, and how could I detect it?

Well, at the absolute most basic level, all a bot really needs to do is spoof a useragent and send a request to the URL provided by the traffic source. It doesn't need to do anything with the HTML it receives - it doesn't even need to parse it. That would mean that it'll show up on your tracker as a hit - but it won't even get past the tracker's redirect.

The "source code" for a bot like that would look like this:

Code:

curl -A "UserAgentString" http://url.com

Seriously, that's it.

So how do we detect that bot? Well, the simplest way is to drop a straight-up HTML meta redirect in before the main landing page. With a total size of about 150 bytes and sitting on the same IP and domain as the main lander, the additional connect and load time for that should be absolutely negligible, so any real browser should get through it and to the main pop.

Depending on the country, between 25% and 66% of all the cheap pop traffic I threw into that funnel didn't make it past the meta redirect! Some of that may have been click loss, but I also tested this in first-world countries on wifi connections only, where the click loss should be close to zero.

Practically speaking, this is actually great news. Using web server or CDN logs, it's very easy to tell if actual requests for the HTML of your landing page have occurred. If those spectacularly fail to match up to the hits recorded in your tracker, that's pretty solid evidence that the traffic wasn't really humans - or even real web browsers - at all.

Most bots don't even parse Javascript

The most startling finding from my initial test wasn't that I didn't detect any headless browsers, but that even though the headless browser tests all failed, most visitors still didn't make it though my testing page at all!

The reason for that's pretty simple: my testing page entirely relied on Javascript-based redirects. If the visitor wasn't a browser with Javascript enabled (like, you know, 99.99% of all real human visitors ever), it couldn't get through the page.

Of the low-bid traffic that I tested, between 50% and 38% of all traffic that passed the meta refresh test, above, failed to parse a Javascript-based redirect. In combination, the two tests filtered out up to 80% of all the traffic I saw!

Notably, on premium traffic I saw a lot less visitors who failed either this or the meta redirect test. That matches perfectly with what I'd expect and implies that the test is genuinely detecting bot traffic.

As a side note, no visitor at all successfully parsed a noscript statement in my code. That's a good test for browsers which genuinely have scripts disabled - they should parse the noscript block fine. So what I was seeing were almost certainly bots, not a sudden flood of paranoid people with Javascript disabled

False Positives To Watch Out For

So most bots are dumb as a sack of bricks.

I continued to test, and did find one test which appeared to filter out a few bots. Testing navigator.languages resulted in a small number of detections of users who appeared to have no language set in their browser (which should never happen unless you're running a headless browser - a bot, in other words).

Those appeared to be entirely limited to people who were detected as using the Samsung browser. However, plenty of Samsung devices showed up with navigator.language set normally too. Aha! A sophisticated bot in the wild!

...Not so fast. More investigation shows that all visitors using Samsung Browser version 5 show up as not having a language set in their browser. This appears to be a genuine bug with the browser, not evidence of fraudulent traffic.

As this is one of the bot detection techniques mentioned by a lot of security researchers, this is worth knowing. Don't assume those visitors are bots.

In a related note, I dumped another technique mentioned in the above articles before even getting to testing. navigator.plugins.length works very well to detect Chrome Headless and PhantomJS - but thanks to a security update in Firefox, it also detects ALL Firefox browsers as bots. So don't use that one

Practical Outcomes And Next Steps

So what does this mean in practice?

Well, it means that it's possible to inject a filter for most bot traffic before your main lander. I'll write up a seperate how-to on this, but in short, just stick a simple page containing nothing but a Javascript redirect before your lander. Bot traffic of the kind I'm describing won't make it past that redirect, and will be easy to spot.

This research also implies that currently, there's not much point attempting to filter sophisticated bots using headless browsers using industry-standard techniques. Either the bot runners are so good that they evade all these checks, or (more likely in my opinion) 95%+ of the bot traffic out there is, as previously mentioned, dumb as a sack of hammers, and isn't using techniques that are anything like as sophisticated as a current-generation headless browser setup.

I'd be interested to look at any premium placements that people have discovered which appear to be bot or otherwise weird traffic. Have you found any placements that pass all simple bot checks, but have unnaturally absolutely tiny conversion rates even across huge traffic samples? If so, let me know below, and I may run specific checks on them to see if there are more complex bots at work there.

And I'd be very interested to hear from other people who have done tests or implemented filters for bot traffic. What sort of bot traffic have you seen out there? Are there more sophisticated bots in the wild, and if so, how have you detected them?

Let me know below!

08-28-2017 01:47 PM #2 rolandb ()

Thanks so much @caurmen, very valuable stuff! Good to know the bots are still dumb, for non-premium placements at least.

08-28-2017 02:50 PM #3 caurmen (Administrator)

TBH I'd love to find some less-dumb bots in the wild so I could study them! Any suggestions for places (placements etc) to look gratefully received

08-29-2017 09:43 AM #4 thethrone (Member)

Have a look at this: https://www.whiteops.com/hubfs/Resou...eration_WP.pdf

I've personally seen various very advanced bots around, none like the one above though (They even created their own HTTP Library)

There are probably more of these types of bots out in the wild..

08-29-2017 09:55 AM #5 caurmen (Administrator)

@thethrone - Very interesting, thanks!

Where have you seen those more advanced bots, if you don't mind me asking? I'm wondering if they're more common on some forms of traffic source than others.

08-29-2017 11:38 AM #6 thethrone (Member)

@Caurmen

I used to do Consultancy work for a Cyber Security company, most of what i found were sophisticated Traffic Generating bots, some cloaked affiliate links and cookie stuffing. Others would just sell fake traffic on shady Russian traffic exchanges etc..

Most of it was targeting: Google, Youtube, Instagram & Then a broad range of "fake/hacked" blogs & websites

I will see if they still have some of the research available, then i can send you a detailed list

08-29-2017 02:24 PM #7 bachroll (Member)

Interesting thread and very useful to try to stop bots from different traffic sources. This brings me to something I read not long ago in another forum in Spanish. I think the user is also a member of STM and he is very helpful giving advices to other people. Basically the system is a first test to eliminate bad placements that are sending bot traffic. The example he gave was using popads but it could also work with other networks.

I haven’t tried this method yet, but maybe you already know what it’s all about (maybe it doesn’t even work anymore). The idea is create a campaign using the “PrimeSpot only” selection.

We also have to bid very low in order to get the traffic that “nobody wants”. This means that we are getting the placements where the bot traffic is coming from and therefore we can use the “Exclude Websites” after running this campaign for about three hours. The budget is up to you but the more invested the more bad placements that will be found. In fact, in the method that I read in this Spanish forum we can repeat the process several times increasing the bid and ruling out more bot placements.

Once we have a big list of excluded Website IDs, we can create a new campaign including the list and start bidding high, as we would usually do. The risk of this is that we can rule out good placements. As I said, this is not my method and I don’t want to get credit for it. I will be trying this system in my next campaign, so I hope it works. What do you think?

08-31-2017 10:47 AM #8 caurmen (Administrator)

I'd recommend pairing that approach with some kind of bot detection if you're going to do it.

Even very low-bid placements, in my testing, sometimes appeared to have quite high rates of non-bot traffic. Of course there's a variety of other things that could be going on to make them less valuable, but I'd still be cautious about a blanket "exclude all cheap placements" approach.

08-31-2017 11:13 AM #9 thethrone (Member)

This is another Bot, with a very specific purpose: https://krebsonsecurity.com/2017/08/...-intimidation/

08-31-2017 12:31 PM #10 bbrock32 (Administrator)

Great case study Caurmen!

One question though I've been asking myself.

Why is it so important to detect bots if at the end of the day you just have two options:

1 -If a publisher is profitable, no matter if bot or no, keep it active
2- If a publisher is losing money, no matter if it has 0% bot traffic, have to pause it.

So I think a better way to optimize publishers is just to focus on cost vs revenue.

The only scenario I would see bot detection being useful is on sources you can actually ask for refunds.

08-31-2017 01:44 PM #11 ervin (Senior Member)

Originally Posted by bbrock32

I think it is important to detect bots because you can then build bot-lists, and you can save money and time needed to block those again on your next campaigns.

While it may not necessary be a good move to put a bad performing publisher on the same list with bots, because maybe it was the offer, or the landing or something else that makes it not being profitable in that specific case

09-01-2017 10:09 AM #12 caurmen (Administrator)

@bbrock32 - that's a great question, and one I've thought about a lot.

To my mind, there are two really valuable uses of bot detection:

1) Early detection of placements that are very unlikely (or worse) to convert profitably because of the amount of bot traffic. Bot testing is ridiculously fast and cheap compared to the conventional approach of waiting for statistical significance on a placement, so it can be a huge money-saver.

For example: if your team is testing a source/geo combination that currently has 1000 placements, and 100 of those are so high-bot-traffic that they're spectacularly unlikely to ever generate positive ROI (say, above 85% bot traffic at the same bid as everything else), it'll cost around 200x - 300x your payout to filter all of those out by waiting for conversion results. However, your team can get solid bot testing results with 100 or so impressions on that placement, which (broad strokes here) is likely to be between 1/8th and 1/16th of the cost. (Depending on your methodology you could go even lower than that, but I'm being conservative here and assuming that the bot traffic isn't 100% consistent.)

That's a pretty big money saver any time you're testing a new geo or traffic source, and it's still useful on a maintainance-mode campaign if you can run the bot test with minimal ROI drop, because of the daily addition of new fraudulent placements to the exchanges you're using.

Heck, depending on the traffic source, it's even worth running a test like this just to filter out the 100% bot placements. They're definitely not going to convert, and you can run the test with even less traffic. If a $10 spend helps you eliminate 30 high-traffic placements that are just pure 100% fraud right off the bat, that's a pretty good investment.

2) Knowing what percentage of bot traffic your high-value placements have can be really valuable for further optimisation, as it gives you much more accurate information on what's really going on down there.

If you know, for example, that you've got a huge volume, borderline-profitable placement with 60% bot traffic, you have information that you wouldn't have otherwise: you know that placement's human visitors are super-high-value. At that point, you can start looking for ways to eliminate the bot traffic from your bidding. Are there useragents it always uses or never uses? Does it ebb and flow by time of day? Can you narrow down the IP block that the bots are coming from then eliminate that from your bidding? If you're buying at a large enough scale (as I know you are ) you can potentially even talk to the traffic source directly about incorporating better bot filtering into their technology.

Likewise, you can have your team look at the placement elsewhere and see if it's still producing that much bot traffic. Perhaps it's available on a bunch of other traffic sources but some of them have better bot filtering (and that implies that their other placements are better filtered too). Or they can check its bot percentage in other geos - I've noticed that bot traffic on placements varies widely by geo. You may be able to find another geo where that high-quality traffic has a lot less bots mixed in.

Finally, as you do more testing and run more campaigns, knowing what bot percentage each placement/campaign/source was running is a good way to further optimise. You may, for example, be able to conclude that 99% of placements with more than 40% bot traffic just don't hit profitability on Traffic Source A - and that may well be the only identifying factor all those placements have in common. You can tell your team to bot test and proactively eliminate those whenever they run a new campaign, giving you a close-to-unbeatable competitive advantage over someone coming in naively who doesn't have that information and thus has to eliminate those placements the expensive way.

(A minor additional note on this - I've seen some success in the past identifying huge placements that absolutely won't convert, and that barring a sophisticated test I'd just have assumed were bot-ridden hellholes and auto-blocked. I recall Grindr was like this: on general campaigns it converted horrifically, but bot-testing it revealed it had an almost zero bot count. Low-priced traffic (because it doesn't convert easily), huge volume, and very high quality in terms of being actual humans - it was a super-valuable source if you targeted campaigns directly to it. But without a bot test, it just looks like another bot placement.)

---

Other than those two uses (and the third use you mention of "ask for a refund", which in my experience is easier the more solid and technically-grounded your bot test is), I'd agree a lot of people get too hung up on bots. There are going to be bots in your traffic - it's the cost of doing business in 2017. If the placement's profitable despite the fact half the clicks you're paying for are just curl on a loop, it's still a profitable placement.

I think of bot testing as being just like any other source of tracking information: it gathers information, and that's it. Sometimes - in fact, quite often - you can use that information to predict behaviour, and make cutting decisions based on that.

12-08-2019 03:03 PM #13 guo19921230 (Member)

Hello, excuse me, I would like to ask a question, almost 2020, do you have a better way to filter the bot traffic? Thank you!

12-08-2019 07:09 PM #14 matuloo (Legendary Moderator)

Several trackers now have this incorporated, so you can check the level of suspicious traffic directly in the tracer and break it down by placement... Voluum, Binom and I think redtrack too, they all have it.

Do you happen to use any of these ones?

12-10-2019 08:39 AM #15 redtrack (Member)

RedTrack has a Fraud Report feature. You can get an overview of it here:
https://stmforum.com/forum/showthrea...-fraud-traffic

or pm me for a demo

Aksana, RedTrack.io team
skype live:a.rudovich

12-11-2019 10:50 AM #16 voluum (Veteran Member)

Originally Posted by guo19921230

Hello, excuse me, I would like to ask a question, almost 2020, do you have a better way to filter the bot traffic? Thank you!

Since this was posted we've come up with our native bot test. Honeypot is designed to lure bots into clicking fake links and count traffic that they generated separately (just like this one) but is much easier to set up as it requires minimal technical knowledge.

I wrote a separate post about it:

The EASY way run a bot test — Voluum Honeypot

@vortex included a lesson on doing a bot test in her 40-day tutorial, you might want to take a look at her post tool (she collected all important links and instructions in one place):

Day 28: Doing a Bot Test

Karolina

Home > General > Affiliate Marketing Forum