could we mirror reddit?

Communist@lemmy.ml · 1 year ago

could we mirror reddit?

7heo@lemmy.ml · edit-2 1 year ago

What’s stopping us from using the api to post all of reddit here in a massive one-time merger?

A lot. I’ll get into details at the end of my answer.

Obviously the Lemmy devs would have to do it, but would there be legal issues? I think it would solve most of the problems with Lemmy, really.

No, they would not. And also, they aren’t going to. But it’s not impossible to do for someone else either.

I’d be willing to donate money to such a project. How much storage could it really take, it’s almost all text, the images are usually hyperlinks that could stay links.

A lot of the content on reddit are videos these days. And if you leave the content on i.redd.it, they will have a very easy time blocking you. If you don’t, even just images can be insanely big these days. I’m talking 10MB per image. Most users don’t know, and don’t care.

So now, the actual details:

I actually had a lot of thought about this question, because I have been considering making a specialized Lemmy instance to provide access to reddit content (I even have a pretty cool domain name in mind) from day one of my inscription on this server. My idea would actually be an unidirectional mirror, sub per sub, with the copy done by scrappers. I didn’t want to rely on any API of any sort on reddit’s side, because we all know how that is going… If scrappers are blocked by reddit (which would only really happen if they require accounts to see anything there, like pinterest), it would still be possible for all users of said instance to use a browser addon (using their own reddit account) to scrape the content in a hidden tab, more or less as they are visiting the instance.

But that’s not the hard part.

In an nutshell, there are two hard parts.

The infrastructure necessary to “mirror” reddit, even slightly, is ginormous. I’m talking multiple top of the line, 2023 servers. Those go for about at least 10k a pop, can go higher. That’s counting some storage, but probably not enough, to be realistic. I think the costs of acquiring said hardware would be in the hundreds of thousands.
The cost of running that infrastructure would also be non-negligible. Not even considering electricity, this is going to require quite a bit of development, operations, SRE, management, and security. And then there’s the networking. I’m not sure how federation is done in the fediverse, if content is proxy-ed or referred, etc etc, but either way, given the volume of information coming from reddit, even a simple referral would easily eat the majority of the federated instance’s bandwidth. Which would potentially result in increased costs to them too.

Which brings us to the last point I’m gonna make, and this one is about the community. Such a project would put an enormous strain on the community. It would probably lead all the instances to de-federate the reddit-like one, for all the reasons enunciated above, but also the fact that many of the long term users (especially admins) of Lemmy seem very opinionated about reddit, its users, and its content. Meaning that instead of having a bunch of instances and a bunch of people all over the world working on getting the content off of reddit and de-centralizing it, it would be one giant instance, bleeding money, and being the easiest target ever for a C&D, or even a direct lawsuit. Not sure they would have much legal ground to stand on, depending on where you run your instance, but they would surely try anyway.

Communist@lemmy.ml · 1 year ago

What if we did it JUST ONCE?

We just need the content that’s already on there to massively improve things, we don’t need a computer constantly running updating the content, we really just need it over here once to match reddit, then we can replace it.

7heo@lemmy.ml · edit-2 1 year ago

Let’s just do some quick math.

From the list here, I counted 1985 subreddits. That’s from 2018. That was five (5!) years ago (yeah. I know).

Let’s assume those subreddits have on average only 10000 posts each (which is really much, much lower than the reality, trust me).

Now, let’s assume that those posts take up around 1MB of storage (that’s a lot of text, but not a lot of rich media. And there’s a whoooole lot of rich media on reddit, so that’s a very conservative estimate).

Even with those absolutely lower-than-reality numbers, it would still give 20TB. That very easily fits on a single hard drive, right?

Except… Unless you work at reddit, you’re going to have to copy this over the network. Meaning that even with peering of 1gbps to reddit (which isn’t gonna happen, they just won’t let you), it would still take 160000 seconds. Which is about two days of uninterrupted reddit scraping. At a full, constant 1Gbps.

Realistically, you can expect to be scraping at around 100mbps at best, and with interruptions. That’s already changing the time it would require to about two weeks at the very best. And that’s not considering the, again, absolutely ridiculously low numbers I chose.

Ah, and the list I linked? It’s without the NSFW material. Which can be easily 50MB+ a video. Of which there are dozen of thousands.