Inquiry about Automated Tagging (and downloading multiple GB of images from Derpibooru)

dualreason
Pixel Perfection - Hot Pockets Spotted
Solar Supporter - Fought against the New Lunar Republic rebellion on the side of the Solar Deity (April Fools 2023).
Non-Fungible Trixie -
Wallet After Summer Sale -

I’d like to get some feedback/discussion, preferably from the staff, regarding an idea I posted about recently:  
Forum Post  
I’m toying with the idea of standing up a machine learning algorithm to help auto-tag images as they are posted. I’d like to know if this has already been attempted by anyone else or if this presents any issues (ex. I imagine downloading a multiple GB/TB training data set will be needed at some stage).
 
Additionally, is this the preferred forum for developer discussion/troubleshooting?
Derpy Whooves
Preenhub - We all know what you were up to this evening~
Artist -
My Little Pony - 1992 Edition
Artistic Detective - For awesome dedication to sleuthing out and maintaining artist tags and links
Economist -
Not a Llama - Happy April Fools Day!

Looking For My Doctor
I’m not on the dev team, but downloading a representative image set that’s already been tagged to train your algorithm, and to compare to human taggers, shouldn’t be any kind of problem. People are downloading gigabytes of images themselves by hand every day and people do regular grabs of new images on the site, so I don’t imagine that you wouldn’t be able to grab enough images for a good base set without any problems.
 
I would suggest, however, that you choose images that have high scores to train your algorithm; they have a better chance of having been tagged correctly, or having been reported if they were tagged incorrectly. If images haven’t been voted on by at least a couple hundred people there isn’t any guarantee that the image was ever tagged correctly to start with.
dualreason
Pixel Perfection - Hot Pockets Spotted
Solar Supporter - Fought against the New Lunar Republic rebellion on the side of the Solar Deity (April Fools 2023).
Non-Fungible Trixie -
Wallet After Summer Sale -

@Derpy Whooves  
I would hope it would be a non-issue, but I imagine it may still appear on the admins’ radar. Ex. a naive approach would be that once I had a candidate build I was ready to train I’d download all the relevant images over a short period of time, possibly resulting in a large bandwidth usage from a single IP address…
 
Duly noted, I imagine there will be several considerations like that to be made along the way. Simply cutting the data set by score may result in some sampling bias, so I’ll need to look at this more closely once I get into this.
 
My real question is: has anyone tried this (or something similar) before for Derpibooru, as far as you’re aware? If there’s existing work out there I can build upon, I’d prefer that vs starting from scratch.
Derpy Whooves
Preenhub - We all know what you were up to this evening~
Artist -
My Little Pony - 1992 Edition
Artistic Detective - For awesome dedication to sleuthing out and maintaining artist tags and links
Economist -
Not a Llama - Happy April Fools Day!

Looking For My Doctor
If you would like to make sure it’s ok, you could talk to the dev team in Discord (link is on the Contact page) or write to ops@derpibooru.org, and then you could work directly with the folks who would be most likely to notice any activity against the API. But since making the API publicly available we’re used to fairly large accesses from single points, so if you’re just downloading a few hundred images for test cases, that’s not going to even show up.
 
My real question is: has anyone tried this (or something similar) before for Derpibooru, as far as you’re aware? If there’s existing work out there I can build upon, I’d prefer that vs starting from scratch.
 
No - no one has tried this that I’m aware of. And to be honest it would break our rules to apply the tagging on the real site.
 
Any kind of mass tagging (as this would be definitely represent) needs to be coordinated with the tagging team before doing it. We have some mass tagging tools today, but they are currently only available to staff. So, please do not run the results of your algorithm against this site.
 
Instead, please create a sandbox of the site yourself from the GitHub and run against that until you get all the bugs out and it’s ready for production, and when you have it working you could invite some of the team to the sandbox to verify the results before talking about moving it to production.
 
But any kind of mass tagging would need to be cleared with the tagging team on staff before it happens.
dualreason
Pixel Perfection - Hot Pockets Spotted
Solar Supporter - Fought against the New Lunar Republic rebellion on the side of the Solar Deity (April Fools 2023).
Non-Fungible Trixie -
Wallet After Summer Sale -

If you would like to make sure it’s ok, you could talk to the dev team in Discord
 
Ok, I don’t have a Discord account, but if that’s the preferred medium of communication, I’ll create one
 
And to be honest it would break our rules to apply the tagging on the real site.
 
Ok, this is getting to the kind of issues I’m looking for - what rule would this be at odds with? I’m looking at rule #2 “If an image is not adequately tagged or sourced, please correct it yourself”.
 
Any kind of mass tagging (as this would be definitely represent)
 
To clarify: I’m using the existing site as a training set. The algorithm itself would run only against newly posted images (this is separate and distinct from the mods’ mass-tagging tools).
 
Instead, please create a sandbox of the site yourself from the GitHub and run against that until you get all the bugs out and it’s ready for production, and when you have it working you could invite some of the team to the sandbox to verify the results before talking about moving it to production.
 
Sounds like pretty standard practice. I think it may be complicated for others to try to stand up whatever solution I come up with, so I would imagine just writing up some kind of report of my results (ex. list of accuracies per tag). I’d favor a fail-open architecture for something like this, so if the algorithm is not sufficiently confident in assigning a given tag it simply would not act.
Derpy Whooves
Preenhub - We all know what you were up to this evening~
Artist -
My Little Pony - 1992 Edition
Artistic Detective - For awesome dedication to sleuthing out and maintaining artist tags and links
Economist -
Not a Llama - Happy April Fools Day!

Looking For My Doctor
@dualreason  
Any kind of bulk changes made without checking with staff first, especially as in the case where it results from a script, is against Rule #2, but it sounds like the tool you’re talking about won’t be writing to the site, so that wouldn’t apply.
 
My guess is that the real challenge you’ll run into will be determining if your algorithm was “right”. That’s why I suggested running it against well reviewed images, like featured images, that have been looked at by enough eyes that there is some sense that they are “tagged correctly”.
 
Personally, if your algorithm proves to be able to id artists, or correctly id characters, that would be very handy. I know that Google can, for the most part, detect the difference between an image of people standing around and a frame from a porn scene, and it would be nice if we had something that could periodically flag images that were rated incorrectly for review, so I wish you luck.
 
But, for sure, talk to the dev and the tagging teams before actually programmatically applying any of those algorithmically determined tags in the production database.
dualreason
Pixel Perfection - Hot Pockets Spotted
Solar Supporter - Fought against the New Lunar Republic rebellion on the side of the Solar Deity (April Fools 2023).
Non-Fungible Trixie -
Wallet After Summer Sale -

it sounds like the tool you’re talking about won’t be writing to the site, so that wouldn’t apply.
 
Not during development, but the end goal would be automated upload of new tags
 
My guess is that the real challenge you’ll run into will be determining if your algorithm was “right”.
 
The traditional approach is to train an algorithm on only 90% of the data set then use the remaining 10% for evaluation. I’ll also take a look at filtering by popularity as you’ve suggested - I’ll need to dig into the ML tools more, but I’ve seen options in the past for weighting, it may be possible to weight popular work more heavily without over-fitting it.
 
Personally, if your algorithm proves to be able to id artists, or correctly id characters, that would be very handy. I know that Google can, for the most part, detect the difference between an image of people standing around and a frame from a porn scene, and it would be nice if we had something that could periodically flag images that were rated incorrectly for review, so I wish you luck.
 
ID’ing individual artists I expect will be challenging (a very highly dimensional data set compared with broad categories like ‘safe’ vs ‘explicit’), though more general art styles of some kind may be attainable. Character ID I would hope would work - the challenge there I expect will arise when segregating canon from OC characters.
 
I believe TensorFlow came out of Google in some way (one of the tools I’ll be looking at), so I expect some of this work will be quite close to Google’s operations.
 
But, for sure, talk to the dev and the tagging teams before actually programmatically applying any of those algorithmically determined tags in the production database.
 
To follow-up on this, I dropped a line on the Discord chat. I got a suggestion to look at E621 for additional training data, and someone mentioned a potential contact I can follow-up with later who may have already started something like this. There’s a bit of skepticism of how effective this will be, but those in the pegasite-discussion thread seemed quite supportive, as you have been as well, and I appreciate that.
 
I’ll probably have to talk with the staff about this in more detail in the future anyway: I see a way for uploading an image, but I don’t immediately see the API endpoint for adding tags to existing images.
Derpy Whooves
Preenhub - We all know what you were up to this evening~
Artist -
My Little Pony - 1992 Edition
Artistic Detective - For awesome dedication to sleuthing out and maintaining artist tags and links
Economist -
Not a Llama - Happy April Fools Day!

Looking For My Doctor
My guess is that the real challenge you’ll run into will be determining if your algorithm was “right”.
The traditional approach is to train an algorithm on only 90% of the data set then use the remaining 10% for evaluation. I’ll also take a look at filtering by popularity as you’ve suggested - I’ll need to dig into the ML tools more, but I’ve seen options in the past for weighting, it may be possible to weight popular work more heavily without over-fitting it.
 
Hmmm … I haven’t expressed myself well.
 
You are teaching an algorithm to tag images.
 
How will you know if the tagging is ‘right’? By looking at the existing human/community sourced tagging?
 
That’s not necessarily right.
 
If you selected any random image on the site, there is a good chance that it is NOT tagged correctly, or in any way that we would want other images tagged.
 
That’s why I suggested highly rated images, because those are more likely to have been vetted by the community and staff.
 
There’s a bit of skepticism of how effective this will be, but those in the pegasite-discussion thread seemed quite supportive, as you have been as well, and I appreciate that.
 
Believe me, I think everyone would love this. But every week we may spend hours fixing tagging done by humans who were absolutely convinced that they were tagging correctly.
 
I would love a tool that could flag possibly mistagged images, but this is crowd-sourced and the community is perpetually asking itself if it is even doing ‘right’ correctly. So, right now, figuring out what “right” is can sometimes take days even for individual images. Basically I’m afraid that the algorithm is going to look at the tags on the site and it will grow up to be a new “Tay” - basically just repeating the same mistakes that it sees humans making.
 
So … I am enthusiastic that this is something that might become possible. But I am also worried that it will be no better than another human. But, as with all automation, it would be no better than a human 24 hours a day, and that’s useful. Even if it’s wrong.
 
It’s finding that balance of useful versus not right enough because it’s learning from wrong examples that is a cause for concern.
 
And today there aren’t more than a hundred people on the site who I trust to be ‘right’ for all tags, and only a handful who I look to for ‘right answers’ when there’s disagreement or confusion over how an image should be tagged.
 
It’s kind of like figuring out how to tell a robot how to weld a car the right way. If you have it teach itself watching humans it will still fuck up 20% of it’s welds every Friday afternoon.
dualreason
Pixel Perfection - Hot Pockets Spotted
Solar Supporter - Fought against the New Lunar Republic rebellion on the side of the Solar Deity (April Fools 2023).
Non-Fungible Trixie -
Wallet After Summer Sale -

For reference: >>1758960
 
That’s an excellent link, I’ll definitely start my research there. 90% accuracy seems a tad low for what I’m hoping for (I’m hoping that’s in part just due to a limited training set size), but it’s good to see that that much is achievable.
 
@Derpy Whooves  
That’s not necessarily right.
 
Well, garbage in garbage out, I don’t presuppose this thing is going to be better than a human. I get your point; I regularly browse the site with a vote threshold applied, I’ll probably start with something similar in the initial stages of development. However, I don’t want to throw away good data if it’s available, just because it has a low score - ex. I don’t want the algorithm to only look at cell shaded inputs because sketches are low scoring. I expect this project will require a non-trivial amount of experimentation with how to segment/process the input data in addition to different models for different tags.
 
There are various ways to check correctness aside from just comparing with the input data. I can do spot checks (ex. if it’s getting less than 90% correctness, I should see misses within a dozen images). I can also look at how it compares against a straight noise algorithm - if the algorithm is no better at predicting a tag than randomly guessing, chances are it’s not capturing any meaningful info.
 
a new “Tay”
 
What is “Tay”?
 
I am also worried that it will be no better than another human
 
Well, I’m going for a consistent human that looks at every upload the moment it appears. I don’t mean for this to outperform a human, just save a real human some work tagging images (namely tagging the images I have been doing myself the past few months).
 
people on the site who I trust to be ‘right’ for all tags
 
I’ve seen a few tags drift in implementation over time, probably due to different people cycling in/out of the tagging community.
 
it will still fuck up 20% of it’s welds every Friday afternoon
 
Haha, lol, I needed that laugh.
Interested in advertising on Derpibooru? Click here for information!
The Travelling Pony Museum Shop!

Help fund the $15 daily operational cost of Derpibooru - support us financially!

Syntax quick reference: **bold** *italic* ||hide text|| `code` __underline__ ~~strike~~ ^sup^ %sub%

Detailed syntax guide