YOLOv7: Trainable Bag-of-Freebies | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	YOLOv7: Trainable Bag-of-Freebies (arxiv.org)
	92 points by groar 1432 days ago

7 comments

sriku 1432 days ago

A rather tangential comment - this paper is an example of how NOT to write an abstract. An abstract is expected to tell me what new piece of knowledge I can learn by reading more. The content of this abstract is only 20% of what a real abstract should be .. the first half of the first sentence is almost all that's needed (could include which archa it beats). The rest of the abstract needs to cover this (perhaps one sentence each) -

1. Intro - a note on the overall problem domain - object detection in this case and bit zoomed in to the DL space. 2. Related work - work so far in the domain .. without critizin it. 3. Problem statement - what is the knowledge gap in the related work this paper is talking about. 4. Solution - how did we address the gap. 5. Validation - how do we claim our solution addressed the gap it was intended to address.

This paper's abstract covers only the last part and sporadically a bit of 2. What I want to know is this abstract is "what is the new learning in the yolov7 arch?"

Perhaps the bigger picture here is that it points to metrics chasing as a proxy for a "research agenda" in the ML community.

yeldarb 1428 days ago

We summarized the high level improvements here: https://blog.roboflow.com/yolov7-breakdown/

kylevedder 1432 days ago

Probably the most interesting trick from the paper is using the head as a soft supervisor for earlier layers of the network, with the intuition being that if the earlier layers learn to imitate the higher capacity later layers, it frees up the capacity of the later layers to better learn the residual and provides more dense supervisory signal.

lostmsu 1430 days ago

Yes, but to my surprise the "compound scaling" provides 3x more improvement in their ablation study. Also, I don't understand Table 8 in their ablation study for aux heads, specifically: why does it have different base benchmark values from Tables 6 and 7?

squarefoot 1432 days ago

As someone who got only his feet wet with OpenCV like 20 years ago, so basic shape recognition and no AI involved, what read/software, etc. would you suggest to catch up and play with current technology without being inundated by theory that I'm sure I couldn't grasp?

montanalow 1432 days ago

Go to huggingface.com and start with some of the tutorials. The operational/engineering skill sets alone are all you need to treat modern ML models like any other black box API/SDK.

intpx 1431 days ago

They call it ‘Tasks’

https://huggingface.co/tasks

Tempest1981 1432 days ago

https://huggingface.co (no 'm')

synergy20 1432 days ago

went there and there are lots of stuff indeed, but I failed to find anything related to "operational/engineering skill sets"?

mdda 1431 days ago

To just play with something : https://huggingface.co/spaces/nateraw/yolov6 (There's an images tab, and some samples below).

If you go to the associated code, you'll see that it needs a 'backbone', 'neck' etc. What is a backbone? Questions that arise directly from the code will lead you towards good blog articles, etc. https://huggingface.co/spaces/nateraw/yolov6/blob/main/yolov...

OTOH, you could go and have a look at (for instance) the Stanford vision courses for a more 'theoretical' approach. But the code itself is often solid guide to what's going on (the frameworks used for Deep Learning map well onto what's being discussed in blogs/lectures/papers).

bj-rn 1431 days ago

MS put up some courses on github: https://microsoft.github.io/ML-For-Beginners

https://microsoft.github.io/AI-For-Beginners/

bigdict 1432 days ago

Start with theory you're sure you could grasp. Understand how convolutions work and that covers a good chunk of theory.

Here's a good resource: https://eli.thegreenplace.net/2018/depthwise-separable-convo....

isoprophlex 1432 days ago

Github repo mentions "teaser: Yolov7-mask" showing segmentation as well. Highly relevant to my interests. Sadly I can't easily discern any other info on this topic.

Anyone knows any more, maybe?

hwers 1432 days ago

What are you using it for if can share? I’ve thought about training some of these and releasing the weights but I’ve never found a reason they’d really be useful personally so it never really happened

isoprophlex 1432 days ago

I'm working on a computer vision pipeline that relies heavily on segmentation to detect objects in video feeds. We capture about 6 hours of video each day. So being somewhat close to real time with our processing rate is important ...

anewpersonality 1432 days ago

We should stop calling it YOLO after the creator quit machine learning.

isoprophlex 1432 days ago

Especially hilarious considering some other people ALSO jumped on the "we made an object detector so let's call it YOLOvX" wagon and released...

Something called YOLOv7.

https://github.com/jinfagang/yolov7

DonHopkins 1431 days ago

Looking forward to the cat detector in YOLOv9.

binibus 1431 days ago

Why? For me at this point YOLO means a family of detectors that in a single pass propose a bounding box per pixel and filters them with some clustering algorithm. When I see YOLOfoo I know what kind of architecture to expect. A more descriptive name like YOLO-tricks instead of YOLOvX would be nice though.

SrslyJosh 1432 days ago

> the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher

Yikes. It's not clear to me if that's the upper limit on accuracy or a limit imposed by requiring that it run at 30 FPS, but still...yikes.

JustFinishedBSG 1432 days ago

It's clearly the latter and I don't see why it would be "yikes". Real time detectors are useless if "real time" means 1fps.

SrslyJosh 1432 days ago

What good is speed if the accuracy isn't significantly better than a coin flip?

From the paper:

> For example, multi-object track- ing [94, 93], autonomous driving [40, 18], robotics [35, 58], medical image analysis [34, 46], etc.

LOL, these are all great use cases for a model with < 60% accuracy!

stavros 1432 days ago

> What good is speed if the accuracy isn't significantly better than a coin flip?

Because distinguishing an object as belonging to one class out of a thousand with 50% accuracy doesn't mean it's a coin flip. You'd need a thousand-sided coin. Random chance in that case is 0.1%, which maeks 50% way, way better.

kalenx 1431 days ago

The only issue with this comment is that it is _not_ what AP means for object detection... https://www.v7labs.com/blog/mean-average-precision

This is definitely not a coin flip, actually somehow close to what a human would produce, IMHO.

elbigbad 1432 days ago

It’s more nuanced than this. It’s not “look at this image and tell me yes or no if there’s a car in it” it’s more like “tell me where all the cars are in this image, if any.” We use this a lot, and ramping up recall we can do some interesting use cases.

I assure you it’s highly useful in the real, real world.

nerdponx 1432 days ago

And that's why nobody actually uses it for those things, at least not yet. Don't forget that advancement is often incremental, and that in this case advancement has actually been somewhat fast. YOLOv3 came out in 2018.

IncRnd 1432 days ago

In YOLOv7, YOLO and v7 don't go well together. No, not at all. YOLO normally means "You Only Live Once", and v7 means it's lived at least six times before this.

While the author likely didn't have that intention, that's what came across.

Even for YOLO meaning "You Only Look Once" YOLO and v7 do not go together well.

gchq-7703 1432 days ago

YOLO in this case stands for "You Only Look One".

DonHopkins 1431 days ago

YAML originally stood for "Yet Another Markup Language" until somebody pointed out that it wasn't actually a markup language, so they retro-named it "YAML Ain't Markup Language".

IncRnd 1432 days ago

Yes.

The point I was making is that YOLO and v7 don't go well together, and that is true for either meaning of YOLO.

Dayshine 1432 days ago

Huh? It means that the approach is to only process the input image frame once, I.e. "look". And this is the 7th implementation of that algorithm.

It's not as if this is named "the final algorithm v7"