Higgs Hunters Talk

Bias in simulations

  • davidbundy77 by davidbundy77

    Sadly I don't have any quantitative data to back this up, but it seems to me that the simulations are more likely to have off-center vertices than the real data and also that the vertices are more likely to have large numbers of tracks in the simulations than in the real data. When I see an image like this one, I keep thinking "Oh, that's a good one, bet it's a simulation." And it is!

    The conclusion is that the simulations are, presumably deliberately, sampling from a different set of interactions than the real data. The real data in this project is only a tiny specially selected subset of all the data from the LHC and it seems that the simulated interactions are in some sense even more special. Is it possible to briefly explain for non-physicists how the simulated data are produced and how they differ from the real data?

    Posted

  • brownfox by brownfox

    Andy Haas wrote a blog post on this topic a little while back: http://blog.higgshunters.org/2014/12/17/tell-me-the-truth/

    In essence, the simulations assume that a Higgs will decay to a new type of neutral particle, which will exist for long enough to move off centre before decaying. So it is hardly surprising that you get more OCVs in the simulations.

    Cheers
    Steve

    Posted

  • davidbundy77 by davidbundy77 in response to brownfox's comment.

    Thanks for reminding me of that link. The explanation is very helpful for understanding the purpose of the simulations.

    This does mean that if we continue to find lots of interesting events in the simulated data, which are not in the real data, then the simulation model is wrong, which at least rules out some parameter combinations.

    There is however the danger that if the simulations are too far removed from the real data they might be predictable and then be ignored by volunteers, which would defeat the purpose of the simulations.

    Posted

  • DZM by DZM admin

    I don't know about anyone else, but I've found some fairly cool OCVs in real data that I thought were simulations until the "simulation!" message didn't pop up.

    So hopefully they're not getting too predictable!

    Posted

  • brownfox by brownfox in response to davidbundy77's comment.

    I think 'wrong' is a bit misleading. The main purpose of the simulations is to find out how good we are at spotting new stuff, so the simulations have to be set up as if that new stuff really exists. No, it doesn't reflect reality, but then it was never intended to.

    You're right that there's a danger if the simulations are too far removed from real data though. Perhaps it would be better to have some of the simulations doing 'ordinary' decays as well as some new ones. And a smaller percentage of simulations would help too (!)

    Posted

  • DZM by DZM admin

    The percentage should drop in about a week (or a little more); that's when we're projecting that we'll have new data!

    Posted

  • davidbundy77 by davidbundy77

    I tried predicting the simulations. From a run of 20 simulations I managed to predict 13. Most of those I missed were the boring ones. I also thought that two real events were simulations.

    This doesn't really prove anything. In many ways the simulations look quite realistic. In particular, they often have messy tracks, artifacts and other wierd stuff just like the real ones, which makes me wonder where all these things come from.

    Posted

  • davidbundy77 by davidbundy77

    Now that we can (probably) identify the simulations in Talk, the discrepancy between the number of OCVs in simulations and the real data is even clearer. It seems that there are lots of simulations of things that don't exist.

    Posted

  • DZM by DZM admin

    Maybe it's just simulations of things that are really rare. In Worm Watch Lab, the egg-laying stuff is only seen every 30-50 videos. Unfortunately, a fair bit of science can be sorting through through a whole lotta nothing to find the good stuff...

    Posted

  • andy.haas by andy.haas scientist

    Thanks for the interesting thread, davidbundy77.

    Yes, the rate of events like the ones simulated must indeed be small. As I wrote about in this blog post:
    http://blog.higgshunters.org/2014/12/06/what-youre-seeing-on-higgshunters/,
    there are only about 50 events of selected Z+Higgs expected in the data from 2012 that you are looking at on HiggsHunters. We know, since we have seen the Higgs decay some fraction of the time, that the fraction of exotic decays like the ones we simulate here can not be more than about 20%. So there are at most ~10 events like the ones we simulate in the HiggsHunters data, if the process is real.

    Given that there's ~50,000 events of data on HiggsHunters, that means that to succeed in seeing the new signal events above the false-positive noise, it would be good to have a false-positive rate (chance of marking a data event with no real signal in it as having signal in it) of less than 1 in 5,000!
    So equally (or more!) important than properly identifying the simulated events is not identifying the data events (unless there's real signal in there - in which case the problem is different!).

    Don't be too worried, however, if your personal rate of misidentification is much larger than 1 in 5,000. We will have all sorts of tricks to reduce the overall misclassification rate on an event once we combine all your responses. Recall that an event is classified by many people, so we can average and throw out outliers, etc. There are also 3 views of the same event. Also, the "magnitude" of the OCV(s) matter - 4 OCVs with 10+ tracks each is impressive, hopefully you won't mark too many of those in data (again, unless there's real signal!). Finally, the position of the OCVs matter - we know to expect a much larger rate of fake OCVs in data where the detector tracking sensors and electronics are located, due to interactions of particles with them - these OCVs can be discarded later on (and are also a nice source of calibration for you!).

    Posted

  • DZM by DZM admin

    Interesting look into the overall methodology, @andy.hass ... thank you! 😃

    Posted

  • davidbundy77 by davidbundy77 in response to andy.haas's comment.

    Thank you for this detailed and very informative answer. I now have a much better idea of how you are using our classifications in combination with the simulated data.

    Let us hope that your estimate of the number of exotic decays is good, otherwise there might not be any such events in this data set.

    Posted