Wednesday, March 7, 2018

Parallel external merge sorting for GitHub code base


Imagine having access to 25 million repositories hosting over 1 billion commits each year. These repositories span diverse languages and platforms, constituting an immense collection of knowledge. The statistics I just mentioned pertain to GitHub's data for the year 2017.

One immediate benefit of this vast collection is the potential to find existing solutions for your own projects. Instead of starting from scratch every time, you can leverage the wealth of shared code to gain valuable insights.

However, searching for the right solution on GitHub can be a challenge. So, how can you effectively navigate this vast repository?

One approach is to extract domain-specific subjects from the code and then filter projects accordingly, focusing only on those relevant to your specific domain. But here's the catch: the source code files often contain an overwhelming amount of information. With file sizes reaching terabytes, not all sorting algorithms are equipped to handle this big data efficiently.

This is where Dzmitry Huba's contribution comes in. Dzmitry Huba has described a groundbreaking algorithm called the Parallel External Merge Sorting algorithm, designed specifically for managing big data on small computers with limited disk space and memory. Thanks to this algorithm, you can now parse code files and analyze the frequency of words used within them. By employing Parallel External Merge Sorting, you can identify the most commonly used words efficiently.

To achieve this, it's advisable to sort the words in each file individually and then identify any peculiar or unusual terms that stand out.

public static void Main (string[] args)
        {
            if (Directory.Exists (args [0])) {
                var words = new List<string>();
                File.Delete(args[2]);
                foreach (var file in Directory.GetFiles(args[0], args[1], SearchOption.AllDirectories))
                {
                    string txt = File.ReadAllText(file);

                    Regex reg_exp = new Regex("[^a-zA-Z0-9]");
                    txt = reg_exp.Replace(txt, " ");

                    // Split the text into words.
                    words.AddRange(txt.Split(
                        new char[] { ' ', '\t', '.', '(', ')', '[', ']', '{', '}', '+', '-', '*', '/' },
                        StringSplitOptions.RemoveEmptyEntries));

                    //LINQ Distinct operator, ignore case?
                    //https://stackoverflow.com/questions/283063/linq-distinct-operator-ignore-case?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
                    var word_query = words
                         .GroupBy(w => w)
                         .OrderByDescending(kv => kv.Count());

                    File.AppendAllText(args[2], string.Join(Environment.NewLine, word_query.Select(kv => string.Format("{0}:{1}", kv.Key, kv.Count()))));
                }
            }
        }

By utilizing Dzmitry Huba's parser and the power of Parallel External Merge Sorting, you can now extract valuable insights from the vast pool of code available on GitHub.

No comments:

debug magazine archive

  71 jounals still available on issuu with great story of netlabels time.  debug_mag Publisher Publications - Issuu