Fastest way in C# to iterate through ALL Guids Possible

-58votes
1871views
71comments

EDIT I know this might sound stupid, but please stop voting the question down. The point has been made. No need to keep voting it down. In fact, Im pretty sure this is the most negative votes for any SO question ever.

I have access to one of the top 50 fastest computers in the world. This question I am trying to complete is for a job that is required to get done. While it might not make sense on an ordinary regular computer in the civilized world, it makes sense to do it with the hardware and capabilities I have. Please understand, that this will be done, it just depends on how fast.

As the title says.

I have made a console application to count all iterations of a Guid. That is 2^128 times.

I want to start with 00000000000000000000000000000000 and end with the iteration of fffffffffffffffffffffffffff.

Is this the fastest way to do so? Im looking for help on the most efficient code possible, just because I know it will take forever and a day to get done.

Im looking for any and all help to speed this puppy up.

WHY You Ask?

I am screen scraping web content. Sadly, the web content was not put in integer format from 0-999999. But it was actually formatted with Guids instead. I want to hit all the guids on the site to see if there is content that resides there.

Here is my main program.

static void Main(string[] args)
    {
        Console.WriteLine("Guid to start");
        string start = Console.ReadLine();

        string start1 = "00000000000000000000000000000000";
        string iteration = start;

        using (StreamWriter outfile = new StreamWriter(@"c:\temp\scanner\guids.txt"))
        {
            for (Double k = 0; k < 5316911983139663491615228241121400000d; k++)
            {
                Console.WriteLine(site + iteration + extension);
                outfile.WriteLine(iteration);
                iteration = NextLetter(iteration);

                try
                {//do work on the Guid
                }
                catch
                {
                    outfile.WriteLine("broke");
                    Console.WriteLine("broke");
                }
                                }
            outfile.Close();
            Console.ReadLine();
        }
    }

Here is the method to increment the Guid.

        public static string NextLetter(string letter)
    {
        const string alphabet = "abcdefghijklmnopqrstuvwxyz0123456789";
        if (!string.IsNullOrEmpty(letter))
        {
            char lastLetterInString = letter[letter.Length - 1];

            // if the last letter in the string is the last letter of the alphabet
            if (alphabet.IndexOf(lastLetterInString) == alphabet.Length - 1)
            {
                //replace the last letter in the string with the first leter of the alphbat and get the next letter for the rest of the string
                return NextLetter(letter.Substring(0, letter.Length - 1)) + alphabet[0];
            }
            else
            {
                // replace the last letter in the string with the proceeding letter of the alphabet
                return letter.Remove(letter.Length - 1).Insert(letter.Length - 1, (alphabet[alphabet.IndexOf(letter[letter.Length - 1]) + 1]).ToString());
            }
        }
        //return the first letter of the alphabet
        return alphabet[0].ToString();
    }
asked by SpoiledTechie.com
Apr 5 '12 at 13:36Z
edited
May 5 '12 at 15:59Z
[19] Scott M.: for the love of StackExchange WHY?!
[0] SpoiledTechie.com: I am screen scraping web content. Sadly, the web content was not put in integer format from 0-999999. But it was actually formatted with Guids instead. I want to hit all the guids on the site to see if there is content that resides there.
[7] John Koerner: GUIDS are just hex representations of numbers, so you don't need through Z, just through F.
[0] SpoiledTechie.com: @John, so I don't need to hit the letters g-z?
[19] GazTheDestroyer: Your going to hit a web server with 2^128 requests?!
[0] Brendan: I'm with @ScottM. Surely there is a better solution than this. Can you explain in more depth what you are trying to achieve?
[1] aboveyou00: This question has no practical answer. See stackoverflow.com/questions/1705008/… for the reason why.
[0] SkonJeet: @GazTheDestroyer - why would he be?
[3] Damien_The_Unbeliever: Get out a calculator. enter 2^128. Divide by what you might consider an insane number of operations per second (i.e. 1000000000). Divide by 60. Divide by 60. Divide by 24. Divide by 365. That's how many years your iteration will take, at the insane rate.
[0] SpoiledTechie.com: I posted an Edit in the text of the post. @BrendanMcKenzie
[3] L.B: If I am not wrong, with 1M/s request it would take 10790283070806014188970529 years
[3] spender: While double can store numbers as large as the one you present, the significand is only accurate to about 52 bits, so it's unlikely to represent the exact value you chose (5316911983139663491615228241121400000).
[0] GazTheDestroyer: @SkonJeet: He's scraping a website for potential guid hits.
[1] spender: In fact, 5316911983139663491615228241121399999d==(5316911983139663491615228241121399999d+‌​1d) is true. Your loop is going to behave unexpectedly. Start swatting on floating point representations too. It's clear that you're missing something.
[0] SkonJeet: @GazTheDestroyer - so could scrape once and test the hits 'locally'? Not that it matters.
[0] JMarsch: There are so many combinations, and given the lag that you are going to encounter hitting a web site, I doubt that you will be able to solve it this way, and get responses in any reasonable time frame. (and that's assuming that the host doesn't see your traffic and block you)
[0] SpoiledTechie.com: Please read the reason for reopening the question.
[2] Brian Gideon: Wow, -11? That is definitely punitive. Of course there is no easy way to enumerate all possible GUIDs, but that does not make this question any less legit. Stargazer712 provided a perfectly reasonable answer.
[0] SpoiledTechie.com: @BrianGideon not if you read why I can actually accomplish such a task.
[5] Brian Gideon: @SpoiledTechie.com: The numbers don't add up. Even if you harnessed all the computing power in the world it would still take a really long time and a lot of electricity. Still, your question was legit and well written and does not deserve a negative score let along a -11!
[1] LukeH: @SpoiledTechie.com: You cannot accomplish this task. (Unless there's something you haven't told us. For example, if you're only dealing with a known, small-ish subset of all possible guids.) Even if you could process a quadrillion guids per second, it'll still take you almost 11 quadrillion years to get through all the possibilities (and that's ignoring the power and storage issues mentioned by Brian and Raymond).
[0] SpoiledTechie.com: @BrianGideon Thanks.
[5] psubsee2003: I'm in agreement with @BrianGideon on this. With a -11 score, this ranks as one of the lowest scoring questions on SO with a c# tag, while a similar but impossible to solve question (and linked to this question) has 309 upvotes. I think we need to cut @SpoiledTechie.com a break
[3] spender: Why not use a Regex to find GUIDs in the page? Unless I'm mistaken, it seems that's what you really want to achieve.
[3] phoog: @SpoiledTechie.com Look at it from the other direction: you ask for the fastest way "to hit all the guids on the site to see if there is content that resides there". You should set a performance goal, and improve the performance of your algorithm until you meet that goal. What's the longest acceptable duration for this task? A minute? An hour? A day, week, month, year? Let's assume it's a year. If so, you would have to generate and check 10,790,283,070,806,014,188,970,529,155 guids each millisecond.
[5] phoog: See how long this takes to run: uint x = 0; do { x++; } while (x != 0);. That was 32 bits; next, time the 64-bit version: ulong x = 0; do { x++; } while (x != 0); How long does that take to run on your "one of the 50 fastest computers in the world"? It should be about 4 billion times longer than the 32-bit version. The analogous loop with 96 bits should take 4 billion times longer than that, and with 128 bits, another 4 billion times longer.
[0] SpoiledTechie.com: @spender Sadly, I would if I could, but the website that contains this information is no longer viewable to the public eye. The data only relies on the server hidden from anyone being able to scrape it as a formal website would allow. Thats why I have to take guesses at it hence the GUID problem.
[51] Eric Lippert: You might have access to one of the fifty fastest computers in the world, but what you lack is an intuition about the size of large numbers, as all these comments and answers are pointing out. Basically you are saying "I need to put the entire contents of the Pacific ocean into this swimming pool", and when people point out that this is impossible, your response is to say that you have one of the fifty largest pools in the world, and a great hose. Guids have been deliberately designed to be in practice impossible to enumerate, so don't try. Find a different solution to your problem.
[0] SpoiledTechie.com: thanks @phoog, Ill give it a go.
[0] SpoiledTechie.com: @EricLippert Ive tried other methods, and sadly this is what I am stuck with. Contacting the servers owner has already been tried with no solution. I consider it dead data, but its also required data to have.
[3] LukeH: @SpoiledTechie.com: If that's really the case then the most cost-effective solution will probably be to contact the server's owner again, ask them how much money it would take for them to hand over the data, and then pay up!
[0] GazTheDestroyer: @SpoiledTechie.com: Re your edit. Have you actually read any of the answers? What you are asking is an utter impossibility even if you do have "access to one of the top 50 fastest computers in the world". Reread the answers and comments and try and grasp the numbers. Apart from anything else, the speed of your computer is irrelevant when hitting someone else's web server.
[0] SpoiledTechie.com: @GazTheDestroyer I have and I appreciate your comments. I am looking for other ways to get access to this data, it just isn't possible yet.
[14] Eric Lippert: The fact that the server owners chose to identify their content with guids is an indication that they do not want you to be able to solve this problem. And frankly, if I were them, I would be writing anti-screen-scraping software that detects when a certain number of requests for non-existing resources have occurred, treat such as an attack, and deny service to that client. The server is holding all the high cards here; they have the content you want and the ability to stop you from scraping it.
[1] SpoiledTechie.com: @EricLippert While I do agree with you, I don't think your helping me answer the problem at hand. Your just adding Meta to what should be talking about the problem. Thank you though.
[33] Eric Lippert: Of course I'm not helping you solve the problem at hand. First off, because the "solution" you've identified is impossible, and therefore helping you achieve it is a non-starter, and second, because you are attempting to mount an attack against a server. I'm trying to talk you out of doing that, because it is (1) success is impossible, (2) it is morally wrong to even try, and (3) it is probably illegal to try. My recommendations are (1) immediately stop trying to attack the server, and (2) talk to a lawyer to see whether you've already committed a crime or tort.
[0] SpoiledTechie.com: @EricLippert There are no crimes being committed. As I stated before, I am trying to find other ways, but it seems impossible at this moment.
[9] Eric Lippert: @SpoiledTechie.com: I am not a lawyer, and I suspect that you are not either. Screen scraping has been found by courts in many jurisdictions to be a form of the "tresspass to chattels" tort. My advice is to (1) immediately stop attacking the server, and (2) consult a lawyer who specializes in the application of tort law to computer systems in order to determine what your legal situation actually is.
[12] cHao: And this is why the question is at -12 (-13 now; grats. I was trying not to get involved). Not because the question is bad, not because the solution is "don't do that", but because according to the OP everyone here who pointed out the mathematical impossibility is somehow wrong just cause he has a really fast computer or something.
[3] John Rasch: Hmm, think of a ridiculously stupid and impossible question that anyone with a half a brain knows is impossible and post to a high-traffic website using a username which happens to be his website? Why is this question not deleted as spam again?
[0] cHao: And the site gets ad revenue too, no less? Good point...
[0] SpoiledTechie.com: I am not trying to spam. Just trying to find the best possible answer to a solution.
[2] Brian Gideon: @JohnRasch: Do keep in mind that a similar question was posted with a score of 309. And one look at SpoiledTechie.com's profile makes it pretty clear that this user is a valuable member of the community. Vilifying this user seems unreasonable especially at the cost of another -100 drop in reputation.
[32] Raymond Chen: @SpoiledTechie.com It doesn't matter how fast your computer is. The limiting factor is how fast the Web server is. Suppose the Web server is amazingly fast and can handle a million hits per second. It will take 10^25 years for it to serve up 2^128 pages. (And even longer for you to download the results.) And the administrators of the server will probably notice after the first few centuries.
[2] John Rasch: @Brian - how is that not further proof? That question is 100% meant to garner views just like this one, because the answer is obvious to anyone who thinks about the question for more than 1 second.
[4] Raymond Chen: @BrianGideon you do realize that the reason the other page got a score of +309 was not because it was a good question. It was because people started giving funny answers.
[0] SpoiledTechie.com: @JohnRasch Sir, I am not looking to garner views or try my best to increase my reputation in any way, shape or form. I am just trying to have a question solved. Even if you don't think its possible, it doesn't mean its no plausible.
[1] Brian Gideon: @RaymondChen: Yeah, I know. The top answer is awesome :) I voted for it. Still, it's a double standard. And I suspect vilifying valuable members of the community is probably not acceptable on SO. I hope I'm not wrong.
[4] Michael Petrotta: @Brian, I think most folks commenting here are responding to how unanswerable this question is, rather than trying to vilify the OP. Some of the answers and comments are pretty tart, but I think a lot of that is a response to how possibly unethical and illegal the OP's proposed solution is.
[4] phoog: @SpoiledTechie.com Try a spatial analogy: an astronomically large sheet of paper, 0.1 mm thick. Fold it 32 times: it's now nearly 430 km thick, the distance from New York City to Washington, DC. Fold it 32 more times; it's now about 0.2 light-years thick, well beyond the outer reaches of the solar system. Fold it 32 more times; it's 837,460,949 light years thick (roughly 8000 times the diameter of the milky way galaxy). Fold it 32 more times... wait, you probably can't, because at the 6th additional fold, it's thickness is larger than the size of the observable universe.
[0] Brian Gideon: @MichaelPetrotta: Agreed. Though, flagging this question as spam (which could result in yet another -100 reputation drop) because of the OP's username seems unreasonable to me especially since the user seems to be a valuable member of the community. No?
[0] Brian Gideon: @SpoiledTechie.com: Look on the bright side. You are probably the only person to have many famous questions and to have the lowest score of all questions in the c# tag.
[0] SpoiledTechie.com: @BrianGideon To tell you the truth, I was never worried about reputation. I just ask hard questions I guess. And over time, I imagine this question will become famous as well. It seems to have garnered enough response that it has the potential to be. The question still has to be answered, and while my current solution might not be the fastest in the world, it will still get the job done and thats all I want.
[0] Michael Petrotta: @Brian, agreed. It's not spam, and shouldn't be flagged as such.
[3] Michael Petrotta: @SpoiledTechie, it won't get the job done, but you might learn something trying. Please do pay attention to the pain you'll be putting the web site through, and the possible legal ramifications discussed here.
[7] LukeH: @SpoiledTechie.com: This is not a hard question to answer, and it has been answered ad nauseam. The answer is this: no matter how much you tweak your algorithm, and no matter how powerful your hardware, you will not get the job done because the laws of physics won't allow it (taking into account the current state-of-the-art in computing, power-generation, storage etc).
[1] Greg D: -1 for obvious troll.
[0] SpoiledTechie.com: @MichaelPetrotta I will. Thanks.
[0] Brian Gideon: @SpoiledTechie.com: Here is what I would do. Completely reword your question. This time focus on ways of identifying the GUIDs in the web content or on the specific problem at hand. The digital lynch mob crucified you because of your proposed approach. You may need to also justify that screen scraping is authorized. As long as it is not done with malicious intent, to gain an advantage over normal users, or there is not a robots.txt then it is probably okay. The fact that you already notified the site is in your favor on this.
[1] Brian Gideon: @SpoiledTechie.com: One approach I would consider is to ask them to provide an index.html (or whatever) that contains valid GUIDs. If they are unwillingly to do that then that probably means you have not been given permission to access their content with a bot.
[0] SpoiledTechie.com: @BrianGideon The sites admins are totally unresponsive... So I don't think it will be possible to get any type of content from them.
[7] Brian Gideon: @SpoiledTechie.com: I'm afraid your options are disappearing rather quickly then. I would notify your supervisors that there is no technical way of accomplishing this task. This a "your people are going to have to talk with their people" kind of issue now.
[4] Anthony Pegram: What you want to do is wait a couple of years, use one of the 50 fastest computers then. That will cut your time in half! Come to think of it, in a few centuries, you might be able to do this near instantaneously. Just be patient.
[3] Mooing Duck: I just estimated that if Google used all of their servers to attempt this, they could complete it in as little as 24,874,442,000,000,000,000,000,000,000 years!
[1] sehe: @MooingDuck I think you need a bigger mantissa in your number representation, The trailing 000,000,000,000,000,000,000 looks way unrealistic this way
[4] dbruning: People are not seeing the bigger issue here. *** WHAT IF HE SUCCEEDS??? *** He may end up using up ALL the GUIDs in the world. Where would that leave the rest of us, without any fresh GUIDs?
[1] minitech: It would be much faster to just try to crack the password of the server's root user :) Not that I'm recommending this, of course, and it would still take years. Assuming your IP doesn't get banned. Maybe the site admins would talk to you after that, though.
[0] thirtydot: I'm sorry, but I had to vote this down. Your question reminded me of this: stackoverflow.com/questions/5508110/… :)
[0] Alan: Considering the amount of resources and time needed to generate (let alone use) even a fraction of the GUIDs (with there being no way to predetermine which ones will actually be valid), it would be more effective to hire someone to track down the content owner's physical address and pay them a visit in person and offer them any amount of money to hand over the data. In fact, even buying the data center would be more economical than attempting to solve this by brute forcing GUIDs (over a network, no less!).
[0] Alan: Here's a more reasonable alternative: figure out what sites may have linked to the pages with the GUIDs in question and try to track down links rather than making random guesses. In any case, try to get a feeling of the economical importance of the content in question. If it is critical enough and no other option seems viable, consider the cost (resources, time, etc) of re-creating that content. It's guaranteed to be less (i.e. faster and cheaper) than the cost of solving the GUID problem (which for all practical intents and purposes you should simply consider infinite).
[0] juancn: "... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question." —Charles Babbage, Passages from the Life of a Philosopher[3]
This was originally posted on Stack Exchange, but it has been deleted.

5 Answers

32votes
9comments

Ok, let's start with a math / computer science lesson:

Let's say you have a 3 GHz machine. That's 3e+9 clock cycles every second. You have 2^128 iterations to go through.

If we were to trim your code down to 1 clock cycle per iteration (impossible btw), it would take (2^128 / 3e9) = 1.13e+29 seconds to complete.

That is 3.15e+25 hours, and 3.59e+21 years. Better get started soon--you're going to be staring at your console for a LONG time.

Bottom Line--the problem is not your algorithm. It's the problem itself. I recommend you look for a better way.

answered by Stargazer712
Apr 5 '12 at 13:43Z
[0] SpoiledTechie.com: Thats why I would like a better way to do things. I know the time needed and I need to speed this up.
[11] Stargazer712: @SpoiledTechie.com, you need to broaden your thinking. You cannot speed it up. The speed is not the problem, its your definition of the problem itself.
[0] Damien_The_Unbeliever: @SpoiledTechie.com - we're telling you that there's no possible way to enumerate all of the guids in your lifetime, let alone something as slow as requesting each one from a web server.
[52] Raymond Chen: Look at it this way: Even if you could generate the GUIDs infinitely fast, your output file is going to be 2^128*16 = 2^132 bytes in size. That is around 10^27 terabytes. One terabyte of storage weighs around 500 grams. The mass of the earth is 10^24 kilograms, so before you run this program, you will need to acquire 500 earths and convert them all to hard drives.
[0] SpoiledTechie.com: Please read why this question should be reopened.
[0] LukeH: @SpoiledTechie.com: Unless you're only dealing with a small-ish known subset of all the possible guids then you're asking something that's impossible with current technology (no matter how powerful it is): Even if you could process a quadrillion guids per second, it'll still take you almost 11 quadrillion years to get through all the possibilities.
[0] SpoiledTechie.com: @LukeH if its possible with a GPU in 6 minutes. it.slashdot.org/story/11/01/13/2024237/…
[2] LukeH: @SpoiledTechie.com: Presumably he's brute-forcing against the WPA passwords, not just iterating through every possible 160-bit number. (The article mentions testing 400,000 possible passwords per-second; if he was only testing that many 160-bit numbers per-second he'd be waiting a really long time for any success.)
[12] phoog: @SpoiledTechie.com Suppose (using LukeH's figures) that you can process a quadrillion guids a second per thread, and you have a massively parallel environment allowing you to run one billion threads at the same time. Your program would still take 11,000 years to run. Using Stargazer712's more likely figures, with, say 100,000,000 processors in parallel, and you're still looking at 4e+13 years, nearly 3000 times longer than the age of the universe (1.4e+10: en.wikipedia.org/wiki/Age_of_the_universe).
12votes
3comments

If you look at the GUID spec you will see that they are constructed from specific sub-fields. If you can take a look at some of the GUIDs in your data, you might be able to get an idea of what those sub-fields contain, and you should be able to shorten your search time. For example, one of the sub-fields is supposed to be a "spatially unique node identifier" (usually the MAC address of the computer that generated the GUID). This is probably invariant across your data set (or possibly contains distinct values from a fairly small set), so that's 48 bits you don't have to iterate through. Likewise, other sub-fields are created from timestamps, so it should be possible to put upper and lower limits on those, thereby limiting your search space. Even if the GUIDs are truly random, there are still some fields (variant and version) that will remain fixed for all GUIDS generated by the same system, so that's at least seven bits you don't have to wade through.

answered by TMN
Apr 5 '12 at 15:42Z
[0] asawyer: Except he would have no guarantee of any of this, or information on how the guids where generated.
[2] TMN: Presumably OP could examine a handful of GUIDs he already has to see what variant/version they are and verify that they follow the rules. If not then, yeah, all bets are off, but if so then it should help to cut down the search space quite a bit.
[0] SpoiledTechie.com: I have about 70 Guids already from the source. Its been noted that they have over 500,000 Guids though.
3votes
2comments

Drop the Console output, that makes it slow. Create a measuring test bench and get some variations running.

Chunk up the 'problem' space and let multiple threads run over different chunks.

Measure what the performance bottlenecks are (my guess would be either ram throughput or disk throughput) and speed that up.

Can't you change the GUID into a byte array and simply add 1 to the correct indexes. That should prevent a whole lot of string copying and speed it up.

answered by CodingBarfield
Apr 5 '12 at 13:41Z
[10] L.B: Yes, using these hints, OP can shorten the execution time a few century :)
[0] SpoiledTechie.com: Can you give an example of a measuring test bench? I am unfamiliar with that? Also another example of converting the guid to a byte array?
2votes
8comments

Not to try to feed the fire on this madness ;), but you could use this as an opportunity to work with the Task Parallel Library -- break it up into pieces, and run it with multiple threads on a multi-core machine. (one way to do it would be to break the whole range (ooo... to zzz...) into a number of pieces equal to the number of logical cores, and have mutiple threads working on it.

Edit

Or, if you want to get even more tricky about it, work out a Map Reduce algorithm for it, and solve it on the Amazon Cloud: http://en.wikipedia.org/wiki/MapReduce

answered by JMarsch
Apr 5 '12 at 13:42Z
[0] SpoiledTechie.com: that makes sense. Could you happen to provide a code snippet of the task parallel library?
[0] L.B: Yes, using these hints, OP can shorten the execution time a few century :)
[1] JMarsch: @L.B. I have to agree -- this would be a fun academic excercise to try to optimze the heck out of a parallel task, but I don't think that you can really solve for this in a reasonable time frame. (although, there is a counterpoint -- a while back, some hackers where able to break WPA security by using the Amazon Cloud as a super computer -- they broke it in 6 minutes with brute force, but massively parallelized using the amazon compute cloud. it.slashdot.org/story/11/01/13/2024237/…
[0] SpoiledTechie.com: @JMarsch Please read the Reopening reason for this question and you will see I do have access to that kind of computing.
[1] JMarsch: If you really need to do this, and you have access to the resources, your best bet is to make the problem as paralizeable as possible. the best parallel solutoin is one where there is no shared state (so no need to lock resources between threads). Read up on Map Reduce (it's all about that strategy -- that's how Google runs 1 search across 1,000 machines). I still think that you are up against a problem that we don't have the hardware to solve, but to stand a chance, you need to really build an understanding of parallel computing.
[0] JMarsch: Here's another really good intro to Map Reduce: joelonsoftware.com/items/2006/08/01.html
[0] sehe: Ok, you want to develop the algorithm as simple as possible and scale across different machines, not cores. If you must utilize multiple cores, you want multi-process, not multi-thread. This is because, you cannot afford to have to debug and improve the program. You cannot afford to 'restart' the operation, any more than you can expect to complete it.
2votes
1comments

This is the wrong approach to increment GUIDs. Use the BigNumber class with a 16byte number. (Or write your own number class/struct that uses two ulong fields. It's not too hard, but it would be a lot easier in a native language with access to the carry flag.) Then use the GUID constructor that takes 16 bytes to create the GUID.

Also, I don't believe that Double has the necessary precision to increment the number you gave it by one. Did you verify that? Your NextLetter method is awful. Ditch it. And don't print anything to screen until you've found an answer.

answered by Brannon
Apr 5 '12 at 13:49Z
[0] SpoiledTechie.com: You have a more efficient method than the NextLetter method?