Confusion regarding side-effects section of Google's MapReduce Research Paper Computer Science

Confusion regarding side-effects section of Google's MapReduce Research Paper
What online courses provide 100% updated materials and worth to pay?
Back of the envelope estimation hacks
Made Another tutorial that teaches you How to Build a Real-Time International Space Station Tracker using Javascript! Very Well Explained!!
Seeking Professional Advice - Is CS degree needed for design/development of websites and apps?
Best current methods for clustering high-dimensional data on the fly, with additional data being added to the set?
Non-Tech Talking in Tech Talk. What do you say?
Nearest-Neighbor on Massive Datasets
What are the hardware requirements for building and understanding simple AI?
Latest from Microsoft mixed reality & AI lab researchers--- great applications for mixed reality: State of the art in 3D Model Fitting!
Uber’s take on JVM tuning
An Illustrated Data Structures Cheat Sheet with Working Code
SAS vs R vs SPSS
Agile Management & Methodology
MS Excel Shortcuts

Confusion regarding side-effects section of Google's MapReduce Research Paper

Posted: 02 Aug 2020 01:32 AM PDT

I'm reading Google's MapReduce paper that was released several years ago, and I had a doubt in section 4.5 (Side-effects).

So what I understand is that if a worker fails while writing the output of a map task, that task would be re-run on another worker. That would cause problems in-general for a non-deterministic program if one reduce task worker would have already read the output while another reduce task worker would be reading a different output that will be produced by the new worker. So why is there a separate mention for when multiple output files are produced.

Following is a snippet of the section

In some cases, users of MapReduce have found it convenient to produce auxiliary ﬁles as additional outputs from their map and/or reduce operators. We rely on the application writer to make such side-effects atomic and idempotent. Typically the application writes to a temporary ﬁle and atomically renames this ﬁle once it has been fully generated.

We do not provide support for atomic two-phase commits of multiple output ﬁles produced by a single task. Therefore, tasks that produce multiple output ﬁles with cross-ﬁle consistency requirements should be deterministic. This restriction has never been an issue in practice.

I'm trying to wrap my head around this but I'm getting really confused. If anyone could help me out or give me a push in the right direction, I'd be really grateful. Thanks

submitted by /u/shahrk97
[link] [comments]

What online courses provide 100% updated materials and worth to pay?

Posted: 01 Aug 2020 03:35 PM PDT

I need your honest recommendation/advice for the following brands. Thank you.

(1) Lynda/LinkedIn

(2) Linux Academy

(3) Pluralsight

(4) O'Reilly, and

(5) egghead

OBS! Who will provide a certificate after completing the course?

submitted by /u/msi39
[link] [comments]

Back of the envelope estimation hacks

Posted: 02 Aug 2020 03:56 AM PDT

submitted by /u/redblackbit
[link] [comments]

Made Another tutorial that teaches you How to Build a Real-Time International Space Station Tracker using Javascript! Very Well Explained!!

Posted: 02 Aug 2020 02:49 AM PDT

You can read the tutorial here on my blog --> https://thecodingpie.com/post/build-a-real-time-iss-tracker-using-javascript/

Live Real-Time ISS Tracker made with Javascript

Tried my best to break this tutorial into small steps, so that any beginner can understand it. Hope you like it :) As always, any feedback is accepted...

submitted by /u/thecodingpie
[link] [comments]

Seeking Professional Advice - Is CS degree needed for design/development of websites and apps?

Posted: 02 Aug 2020 02:12 AM PDT

submitted by /u/artofescapingsince95
[link] [comments]

Best current methods for clustering high-dimensional data on the fly, with additional data being added to the set?

Posted: 01 Aug 2020 01:23 PM PDT

Years ago I did early research work in various forms of unsupervised learning, but I've been away from this area for a long time. I now have an application for some old work I did in this area -- but I'm trying to find what the state of the art is now.

So: I have M instances of N-dimensional vectors (most likely between 10 and 20+ dimensions). I have no a priori idea how many data points there are, but I know it will grow over time. I'm looking to find the clusters in this data, though I can't predict how many clusters there are or how they might overlap -- so I can't pre-set a number of categories.

I want the algorithm to be able to figure this out on the fly, and continue re-figuring as new data points are added to the set. I also want to be able to identify a new data point's identifying/categorical cluster, and quickly find other instances near it in N dimensions, in its same cluster or not, with a minimum of checking individual instances.

My go-to for this (being ancient) is an evolutionary variant of Kohonen's LVQ3 algorithm, but I've toyed with K-means as well. Is this a known/solved problem? Are there different/better algorithms used for this now?

And, if this isn't the place for a question like this, what's a good subreddit for discussing this? (I asked this in a weekly discussion thread in /r/dataisbeautiful as well; no responses thus far.)

Thanks.

submitted by /u/iugameprof
[link] [comments]

Non-Tech Talking in Tech Talk. What do you say?

Posted: 01 Aug 2020 04:50 PM PDT

What are some words that are not definitively technical such as constructor, parameter (words found in programming basically), but come up while talking programming anyway? Words that a layman might not hear otherwise?

i.e. I find I use syntax often, but I could probably teach someone to code without it. I can't say the same for a word like variable. I also remember from about a decade ago, the first word my CS teacher ever gave us the definition to was jargon, which also fits.

Can you think of any?

submitted by /u/Cryptographer2020
[link] [comments]

Nearest-Neighbor on Massive Datasets

Posted: 01 Aug 2020 12:32 PM PDT

In a previous article, I introduced an algorithm that can cluster a few hundred thousand N-dimensional vectors in about a minute or two, depending upon the dataset, by first compressing the data down to a single dimension.

The impetus for that algorithm was thermodynamics, specifically, clustering data expanding about a point, e.g., a gas expanding in a volume. That algorithm doesn't work for all datasets, but it is useful in thermodynamics, and probably object tracking as well, since it lets you easily identify the perimeter of a set of points.

Below is a full-blown clustering algorithm that can nonetheless handle enormous datasets efficiently. Specifically, attached is a simple classification example consisting of two classes of 10,000, 10-dimensional vectors each, for a total of 20,000 vectors.

The classification task takes about 14 seconds, running on an iMac, with 100% accuracy.

In addition to clustering the data, a compressed representation of the dataset is generated by the classification algorithm, that in turn allows for the utilization of the nearest-neighbor method, which is an incredibly efficient method for prediction, that is in many real world cases, mathematically impossible to beat, in terms of accuracy.

Said otherwise, even though nearest-neighbor is extremely efficient, with a dataset of this size, it could easily start to get slow, since you are still comparing an input vector to the entire dataset. As a result, this method of clustering allows you to utilize nearest-neighbor on enormous datasets, simply because the classification process generates a compressed representation of the entire dataset.

In the specific case attached below, the dataset consists of 20,000 vectors, and the compressed dataset fed to the nearest-neighbor algorithm consists of just 4 vectors.

Classification predictions occurred at a rate of about 8,000 predictions per second, with absolutely no errors at all, over all 20,000 vectors.

https://derivativedribble.wordpress.com/2020/08/01/nearest-neighbor-on-massive-datasets/

submitted by /u/Feynmanfan85
[link] [comments]

What are the hardware requirements for building and understanding simple AI?

Posted: 01 Aug 2020 01:25 PM PDT

Idk if this is the right place to post, but I was hoping you guys could answer my question. *Hardware for pc

submitted by /u/TaLl_sLimE
[link] [comments]

Latest from Microsoft mixed reality & AI lab researchers--- great applications for mixed reality: State of the art in 3D Model Fitting!

Posted: 01 Aug 2020 10:57 AM PDT

submitted by /u/MLtinkerer
[link] [comments]

Uber’s take on JVM tuning

Posted: 01 Aug 2020 10:23 AM PDT