Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto rotate frames containing faces that are not vertical to fix crappy insight face bug. #364

Merged
merged 18 commits into from
Dec 12, 2023

Conversation

LatentLoser
Copy link

@LatentLoser LatentLoser commented Dec 7, 2023

This PR is kind of huge and contains both a big feature upgrade and several major bug fixes. I'm not planning on making any major changes to this PR in order to help get it merged, so it will just be as-is, but I did still want to share the code in the hopes it helps the roop community move forward from this stupid bug in insight face that can't deal properly with faces that don't appear vertically in frames.

I developed this weeks ago, and from memory the several bug fixes also in it include:

  • some permissions issue in Linux that was preventing me from creating facesets
  • the 'open image folder' being broken on Linux
  • not being able to process videos longer than 9999 frames
  • some video file not saving due to wrong variable being used

The feature: Auto-rotating frames containing faces that are not vertical

The feature upgrade itself is auto rotating frames that contain a face that is not vertically oriented in order to make it vertically oriented during the faceswap, and then rotating it back to its original position afterwards. Normally, if you try to perform a faceswap with the insight face model on a video that has faces in a scene where someone is lying down or doing a confused dog pose with their head tilted sideways, then the face swap fails and generates horrible garbled results.

The options you need to select to use this feature:

  • Select face selection for swapping -> Single face frames only [auto-rotate]
  • Select video processing method -> In-Memory Processing
  • Action on no face detected -> Skip Frame

How the bug in insight face works

How the bug works is that the insight face model returns the correct location for the bounding box of the face, but it tends to really screw up the position of the 2d landmarks in the face, and this is why the faceswap goes horribly wrong and becomes garbled. A number of users in the community have naturally happened upon the solution of simply rotating the video first before rooping it, and then rotating it back once it has been rooped. This works, but is a giant pain in the ass to prepare in cases where such a scene occurs part way through a video. Hence, I had a look to see if this process could be automated, and to my astonishment it actually can.

How the algorithm to fix it works:

  • You work out if the face in the frame is horizontal (or thereabouts) by checking if the bounding box is wider than it is tall.
  • Then you need to leverage the positions of the 2d landmarks of the forehead and the chin to check which way the face is oriented in order to determine if you need to rotate the frame clockwise or anti-clockwise.
  • Then you need to translate the position of the bounding box to where you think the new bounding box will be once the frame has been rotated.
  • Then you need to detect all faces in the rotated frame, and to figure out which is the face you're intending to swap you compare the intersection over union between the translated bounding box (your estimate of where the face in the rotate frame will appear) and the bounding boxes of all the detected faces and the one with the smallest IOU is the one you're trying to swap.
  • Now that you have a reference to the face in the rotated frame you can do the faceswap, and once you're done you just unrotate the frame.

Notes

The works almost perfectly with a few caveats, and could almost certainly be furthered improved, I just didn't care to at the time. The caveats are that it sometimes fails to detect the face in a frame, which results in the face flickering back and forth in the final video sometimes. The results I got were pretty usable, but to make them basically perfect by default what I did was just select the frame to skip frames where no face was detected. This works near perfectly, but results in some video artifacts where the video looks a little bit choppy in parts, and of course it won't contain any frames that don't contain a faces, which may be undesirable, but that's just the trade-off at this point.

I'm sure this can be vastly improved to get near perfect results on basically any video without having to do any work to prepare the video in any way shape or form. I don't really like working in Python and find untested, dynamically typed codebases frustrating in general, and the structure of this codebase isn't well factored enough to easily make the subsequent changes needed to really take it to the next level, so I kind of ran out of patience with it and put it down at "good enough", but It could be improved to the point where the problem is basically completely solved.

Some ideas to push this much further:

There are basically two sub-problems remaining to solve here.

  • The first is that the model sometimes fails to detect a face in the rotated frame, despite one actually existing, and this leads to situations where 99.9% of the frames are processed correctly, but you get a little bit of flickering.
  • The second is that whilst skipping these frames generally (though not always) results in smooth looking video the downside is that it means you're also forced to skip all the frames that genuinely don't contain a face in them, which is sub-optimal.

I think to solve these problems you probably need to do something like shifting away from attempting to process each frame in isolation, but actually keep track of what processing actions took place on what frames, and do a second pass at the end where you apply some heuristics to look for cases where the model has likely made a mistake. For example, for each frame that was skipped check if the frames before and after contained a face that was rotated and swapped, and if they did then assume it was skipped by mistake and instead interpolate the missing frames.

In theory if you develop a reliable algorithm to detect and fix this bug in insight face then that would allow you to build a pipeline that builds a dataset that could be used to train a model that augments insightface and fixes this problem without requiring all the code here.

Further to that, if someone takes this and turns it into a ComfyUI node, and that allows us to build a workflow with it where we can feed it a video, have that video rooped reliably, then apply an img2img pass using a LoRA of our rooped subject to greatly enhance the quality of the face, then we could theoretically pump massive volumes of videos through that to build a dataset to train a better model which can do high resolution faceswaps. Probably anyway, right? I'm not a machine learning engineer, but it appears that's how this sort of thing works.

rotated_bbox = self.rotate_bbox_clockwise(original_face.bbox, frame)
frame = rotate_clockwise(frame)
target_face = self.get_rotated_target_face(rotated_bbox, frame)
else:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a # here too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a little confused when I saw this comment because I thought I removed this. Turns out I forgot that one last commit. Fixed it now.

@Oil3
Copy link

Oil3 commented Dec 7, 2023

I tried, sure does reduce flickering. Impressive

@LatentLoser
Copy link
Author

I tried, sure does reduce flickering. Impressive

Right! Wish I had more energy to improve it further, but it's pretty cool.

@Oil3
Copy link

Oil3 commented Dec 8, 2023

I tried, sure does reduce flickering. Impressive

Right! Wish I had more energy to improve it further, but it's pretty cool.
You had a brilliant idea, and impressive it basically ran straight out of the box!!

Have you ever thought what happens if we change the original refacer white square masking by a green one? Woudln't it be much easier to detect? So much overexposed and natural white everywhere, green almost nowhere. That thing mask_h_inds, mask_w_inds = np.where(img_matte==255)

I have some issue with something silly that might take you less time to solve than to read this.
I and some others need to have a checkbox for to start browser automatically.
.launch(inbrowser=True, in /ui/main.py
Writing this I think I got it though

@LatentLoser
Copy link
Author

Interesting note about the white square masking box. I don't know about how the actual faceswap works as I didn't need to read that part of the code to achieve this, but what you say makes sense on an intuitive level, since that's exactly why they use green screens in video production. Can I ask what problems you've encountered? How do you know when you're encountering such a problem and that's the cause of it? I may have been encountering some of those issues too and just not been able to spot it.

I find another case that really annoys the heck out of me are frames where the subject is looking downwards and it's just not able to reliably match up the shape of the face/head in a way that makes sense. I think in theory it's possibly to just drop all of those types of those shitty frames if you were to train a classifier which takes the detected face as input and spits out a "yeah this isn't going to work, so just skip the frame" rather than trying to do low quality swap. Though to be honest, that's the approach I could have taken in this PR too, but I haven't trained my own custom model that isn't dreambooth, so I just tried to work out if I could do it based on heuristics from the data coming back from the inferenced face detection. Maybe that means the same tactic might work for that case too... like for example if the distance between the landmarks that run from the forehead to the chin are too squished together then you can assume the subject is looking down and just skip the frame. Hmm...

@Oil3
Copy link

Oil3 commented Dec 8, 2023

Im trying to better the experience and results on mac. If drawthings.ai (mac stable difuson app that's really good) can leverage GPU CPU Neural engin, add its own metal optimizations, manage memory without being full-retard in putting everything in the flexible_size_partition_for_swap thats only limited by available disk (a issue with python sometimesà, and not being coreml exclusive while proposing a coreml model temporary conversion....
I think there is a lot of room for improvement! It's been a couple months and I'm starting to fix some stuff now.
latests was simply changing a function inside python scripts in order to unlock MPS for the whole model (MPS in clip2seg was stuck by pytorch bicubic interpolation, forcing CPU, but runs fine in MPS using bilinear.

The white rectangle stroke me because I used to have an issue with a visible square; and also because the majority of my videos kinda have a white background so I found it silly. Side note, the processor puts a 1pix dark border to delimit.
I also found this silly, because around the face is usually dark hair.

Problems I find are systemic when the top of the face is out of view, like when there is a close up and the top of the forehead gets out of view: no face detected.
A dirty fix for this is to add a thick border at the whole video.

I think detection should focus on eyes and nose rather than overall shape, and then use some sort of multi point trapezoid correction to morph source onto target (or vice versa), focusing mostly on the eyes, nose and mouth.
As a side note, I don't understand why it's so complex to faceswap.
Computers are computers, there is no inteligence, nor consciousness. They don't learn, they just scrap data mindlessly following a mathematical multilinear relationship between sets of algebraic objects related to some vector space.
They just do what a human tells it to do. The best employee. If it says no, a human says sudo.
All this to come back to your case and why I think we need more manual corrections possibilities, and tools to facilitate doing the work in parts.
Instead of droping the frame, maybe keep them for a manual review, where we could guide manually the shape, perform basic movements of the camera, rotation/zoom.
Manual mask, the processor looking for the face might have it difficult if there are a bunch of things in the way, hair, arm, glasses.
A manual mask that removes the crap, then reapply its own face, and then source face maube.

You managed the double 90 rotations, why not a 360 rotation? maybe there is somethink that allow for a quick complete rotation. Or add some entropy and do a few rotations, and taking the best.

And I talked about needing a human and multiple steps?
Faceswapping everything that is "easy" for the model, leaving the rest for manual review and tweaking.
With a built in gui.
Same for masking, manual masking and then shaape detection for the eventual other frames?
Background removal and background replacement.

I wrote more than I expected

@C0untFloyd
Copy link
Owner

Interesting discussion you got here and a nice sounding PR. I'm hoping to find the time for a merge this week.

@Legendaryl123
Copy link

Interesting discussion you got here and a nice sounding PR. I'm hoping to find the time for a merge this week.

I was reading the PR with intense fascination! C0unt, you made one heck of an amazing workspace. I use it very often and it's very impressive. I wish I had the skill to work with some of these PR's I've read around this project. I managed to get the TensorRT provider working on my RTX 4050 by making some adjustments. I just get the occasional black box instead of a face. But I'm trying my best to learn and follow along with you guys. Thank you very much for looking out for my SSDs and providing and in-memory processing option.

return frame
if roop.globals.no_face_action == 2:
if roop.globals.no_face_action == skip_frame:
#This only works with in-mem processing, as it simply skips the frame.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we duplicate previous frame in lieu of using unprocessed frrame?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conclusion that I came to when originally thinking about this question was that you could, and it would work well for a subset of cases, but it would likely result in other scenarios where there are long periods where the video appears to be frozen because this happens many times in a row. Think about the behavior when there no face in the frame and is that the desired behavior under that circumstance?

Alternatively you could add that as an option. But its a quick and dirty fix with limited utility that adds complexity to the code making it even harder to work with than it already is.

I think what's really needed is to track what actions were taken during processing, and use that info to do a second pass, where the second pass employs heuristics that solve known edge cases.

Ideally you don't want the user to have to navigate an increasingly complex web of options in order to process their video successfully, you just kind of want them to hit process and have it work.

@C0untFloyd C0untFloyd changed the base branch from main to dev December 12, 2023 20:07
@C0untFloyd C0untFloyd merged commit a55b78d into C0untFloyd:dev Dec 12, 2023
@C0untFloyd
Copy link
Owner

C0untFloyd commented Dec 16, 2023

@LatentLoser I finally had the time to look into your PR and I have some questions:

  1. Why the extra imutils dependancy for rotation when 3 lines above your new method is a method doing exactly that, using numpy?
  2. Why can't this be done to every found face in an image? What I mean is, cut out the bounding box of the found face, rotate it like you do with with the whole frame, do the faceswap, rotate it back and paste it into the resulting image. This would have the advantage that no post-processing would have run before the final comparison (see the next question).
  3. Is rotation even needed? I did some limited testing and the swap quality itself wasn't different, rotated or not. Do you perhaps have some example images to check it out? Perhaps swapping the Retinaface Detector with a better one like e.g. YuNet would be enough.I however suspect the code making a huge difference is this
                        # check if the face matches closely the face we intended to swap it too
                        # if it doesn't, it's probably insightface failing and returning some garbled mess, so skip it
                        cosine_distance = compute_cosine_distance(swapped_face.embedding, input_face.embedding)
                        if cosine_distance >= self.options.face_distance_threshold:
                            num_faces_found = 0
                            return num_faces_found, frame

This will skip the swapped face if it is very different from the reference face. This however is likely to fail quite often, especially with faces looking sideways, even more if the target face morphology is very different and finally when the face similarity value used is the strict default value. Another drawback, the similarity test is performed after all of the post-processing. All of the Enhancers change quite a lot of the face identity. IMO there should be at least a second setting just for this or some percentage like 20% tolerance to the regular one, While testing I had perfectly swapped faces where the similarity value was > 1.0 which usually means that it is a different person and it would be skipped by your code.
UPDATE: I found an image where it indeed made a difference. The landmarks approx. from forehead to nose were offset by about 20 pixel down and to the left obviously confusing the faceswap model.

Enough nitpicking. I'm happy when people try to improve this actively, thanks again!

@LatentLoser
Copy link
Author

That dependency might just be left over from earlier attempt to accomplish what I was trying to and I missed cleaning it up in the PR?

The rotation is the thing that makes the massive difference, it's a well known bug in insightface. It's particularly obvious if you have a face that's at a 90 degree angle as the swapper often severely mangles the face generating some real nightmare fuel and this is due to the 2d landmark positions getting all screwed up. If you haven't naturally happened upon videos that contain this problem, you can simulate using any video you have by just using ffmpeg to first rotate it so that the face appears horizontally and then try running it with and without the autorotation fix.

The similarity comparison thing, I'm not sure I really tested how much of a difference that actually made, but it seemed like a logical thing to do. If it will cause a problem elsewhere then it's probably fine or better to remove it. It's the autorotation that does all the heavy lifting in terms of improvement.

As for the actual rotation part, yeah potentially could be done differently, this is just the way I thought of at the time and it worked, so I was happy. It could almost certainly be improved or expanded, but what's there at least works and makes a massive difference.

C0untFloyd added a commit that referenced this pull request Jan 8, 2024
- New auto rotation of horizontal faces, fixing bad landmark positions (expanded on ![PR 364](#364))
- Simple VR Option for stereo Images/Movies, best used in selected face mode
- Added RestoreFormer Enhancer - https://github.com/wzhouxiff/RestoreFormer
- Bumped up package versions for onnx/Torch etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants