That message was from 3 months ago and I am sure things have gotten better.
1) As far as I understand the Spark DataFrame can point to on disk data (i.e. Hive etc) but since a degree of the query processing is performed in memory (parts of the scatter), depending on how much data I am pulling in from the disk-based database, and depending on the specifics of the operations I am running I can still run into issues.
2) The care that has to be taken is that the physical operations must be capable of representing the space of logical operations.
For instance consider the hypothetical operation:
a = load table from src 1 (it has 1000 rows)
b = load table from src 2 (it has 1000 rows)
a2 = a.filter(some condition 1)
b2 = b.filter(some condition 2)
# a2 and b2 coincidentally has the same number of rows say... 400 rows each)
combine columns in (a2, b2)
# i.e a new table with 400 rows but with both columns of a and columns of b. This is not a join.
# In a way it is with "add sequential row numbers to a2", "add sequential row numbers to b2", "join them and sort"
This operation is somewhat non-relational and messy/inefficient to express on a SQL relational backend. (at least I can't think of a nice way to represent this relationally. I am very much willing to be corrected on this.). I can think of ways to do so with creation of intermediate tables with implicit row numbers, but it is not so clear how to do this cleanly.
Also because "columns are first class" we allow arbitrary modifications of columns. (The SFrame behaves more like a dictionary of columns than a table). We find this much more natural. Like: normalizing a column:
sf['y'] = sf['y'] / sf['y'].sum()
Yes the first hypothetical operation is somewhat synthetic, but in a way, designing a logical abstraction to support multiple backends also means that the logical abstraction must contain only the minimum common denominator of all backends.
3) Direct inference can be done meaningfully only if the JSON schema is well defined. i.e. every row has a dictionary of the same keys/values. What we have instead is a CSV parser (that can read JSON quite well as it turns out), a native dictionary and list type (both which are "schema-free" and supports arbitrarily recursive types), and operators that let you but bulk manipulation on these types (unpacking dictionaries/lists, packing them, stacking them, etc). This is somewhat more flexible and allows us to handle some really really irregular JSON schemas we have encountered.
(One of the stranger ones is a JSON object where one of the string fields is itself a JSON escaped as string. This is easy to convert with what we have since sf['string_column'].astype(dict) performs a full "JSON-like" parse on each cell. (Really our parser handles a superset of JSON).
For now I think our SFrame has a somewhat richer API surface area (at least for interesting types), though your Dataframe have nicer connectors as well as distributed execution. In any case we have also added a query planner/optimizer and this allows us to look into retargetable backends too :-) So real execution architecture differences may start to fade over time.