Pencarian

Rss Posts

 

 

 

Berita pada kategori ‘Sindikasi’

Will Oracle kill MySQL?

Jul 28, 2010

I get asked this question often. It was mentioned again recently in a NYTECH executive breakfast with RedHat CIO Lee Congdon.
The short answer is No.
There is clear evidence that in the short to medium term Oracle will continue to promote and enhance MySQL. Some of these indicators include:

EU 10 point commitment in December 2009 – See Oracle Makes Commitments to Customers, Developers and Users of MySQL
MySQL Conference April 2010 – Opening keynote by Edward Screven State of the Dolphin
Oracle Magazine Jul/Aug 2010 – Interview with Edward Screven Open for Business.

It is clear from these sources that Oracle intends to incorporate MySQL into Oracle Backup and Security Vault products. Both a practical and necessary step. There is also a clear mention of focusing on the Microsoft platform, a clear indicator that SQL Server is in their sights without actually saying it.
What is unknown is exact how and when features will be implemented. Also important is how much these may cost the end user. Oracle is in the business of selling, now an entire H/W and S/W stack. They also have a complicated pricing model of different components with product offerings. I assume this will continue. There are already two indications, InnoDBbackup included for Enterprise Backup (from April Keynote) and 5.1 enterprise split. (Note: while this split may have existed prior to Oracle, it is now more clearly obvious).
MySQL can never be seen as drawing away from any Oracle sales of the core entry level database product. It is likely Oracle will provide a SQL Syntax compatibility layer for MySQL within 2 years, however it will I’m sure be a commercial add-on. Likewise, I would suspect a PL/SQL lite layer within 5 years, but again at a significant cost to offset the potential loss of sales in the low end of the server market. There continues to be active development in the MySQL Enterprise Monitor, MySQL Workbench and MySQL Connectors which is all excellent news for users.
Moving forward, how long will this ancillary development of free tools continue? What will happen to the commercial storage engine, OEM and licensing model after the 5 year commitment? How will the MySQL ecosystem survive.? There is active development in Percona, MariaDB and Drizzle forks, however unless all players that want to provide a close MySQL compatible solution work together, progress will continue to be a disappointing disjointed approach. The 2011 conference season will also see a clear line with competing MySQL conferences in April scheduled at the same time, the O’Reilly MySQL conference in Santa Clara California and the Oracle supported(*) Collaborate 2011 in Orlando, Florida.
I have a number of predictions on what Oracle ME MySQL may look like in 5 years however this is a topic for a personal discussion.

PHP for Android, PHP 6 canceled, APC in PHP 5.4

Jul 26, 2010

By Manuel Lemos
On this episode of the Lately in PHP podcast, Manuel Lemos and Ernani Joppert comment on the launch of the PHP for Android project and the consequences for the PHP market.

They also talk about the cancellation of PHP 6 and the inclusion of features planned for PHP 6 in PHP 5.4, like the integration of the APC cache extension in the main PHP distribution bundle.

Some of the most interesting classes nominated for the May edition of the PHP Programming Innovation Award are commented, like the PDF text extract, PHP duplicate files finder, Fast Fourier Transform and splx_graph.

JavaOne News Update 1

Jul 26, 2010


An update on some recent News on
JavaOne 2010.
As you know
JavaOne San Francisco is Sep 19-23, 2010.
The
Official page
has links to the
Registration Page
and the
Online Catalog.
News updates include:


A surprisingly useful & manageable Catalog-as-tweets
via
@javaoneconf


Availability of
Schedule Builder (post)


Open enrollment in
Java University (post)


Announcement of dates for JavaOne Brazil and JavaOne China (post).

• The day before there is a
MySQL Sunday!

• And, the
Duke Awards
submissions page seems to still be active.

Also, this year will be the 15th anniversary for Java, and the 5th for GlassFish.  Don’t know if there will be a BDay party for Java; still hoping we can put something together for GlassFish, we will see!

More related news are tagged
JavaOne.

Rails on PostgreSQL: Pivotal Labs Talk – Scaling a Rails App with Postgres

Jul 24, 2010

I’m slowly catching up with my podcast backlog and came across a Pivotal Labs talk from May 2009. In this talk Josh Susser and Damon McCormick are presenting on Scaling a Rails App with Postgres . It’s a little dated now – this talk was given was when PostgreSQL 8.4 was in beta – but, still, lots of good stuff. Here are some notes:

  • They started with an existing Rails app with lots of data, so they had some constraints – not greenfield development.
  • Around the 5-6 minute mark there’s a good discussion of PostgreSQL’s query optimizer and how it analyzes a table’s data distribution. One takeaway (mentioned around 16:20) is to run vacuum more often on a particular table if there are a lot of writes.
  • 10:00 How to set STATISTICS for a particular table.
  • 11:00 Using partial indexes.
  • 14:00 Indexing on expressions.
  • 18:10-23:00 A nice discussion of the EXPLAIN output.
  • 23:45 Here they talk about wide columns. I’ve seen this in MySQL as well, where splitting text data out into a separate table yielded some good speedups.
  • 26:10 Some discussion of pg_bench.
  • 35:30 How long does it take to add an index to large tables? They saw times of up to an hour for tables with millions of rows.
  • 36:30 clustering your data in order to get PostgreSQL to write it more efficiently.
  • 37:30-48:00 A thorough discussion of partitioning tables via table inheritance. They used an ActiveRecord model (39:23) with a bunch of utility methods. They also had a cron to periodically create new partitions. At 45:15 they make a nice distinction between using partial indexes and partitions – one advantage is that a partition’s indexes can be different than its parents indexes. At 49:00 they mention maybe doing a plugin, not sure if that happened.
  • 52:00 Some discussion of full text search via tsearch.
  • 53:00 PostgreSQL’s lack of built in replication outside of WAL shipping, Slony, etc. Thank goodness 9.0 will address this!
  • 54:00 Some props to Engine Yard on their PostgreSQL support.

Good stuff all around, and thanks to Pivotal for posting these great talks!

Ketan Padegaonkar: Code Complexity Visualization for Ruby

Jul 21, 2010

Only Valid Measure of Code Quality

Only Valid Measure of Code Quality

Image from http://www.osnews.com/story/19266/WTFs_m

WTF implies lack of clarity. Clear code is easier to understand, easier to maintain and easier to extend.

Announcing saikuro_treemap ? an easy to setup tool to generate complexity treemaps of ruby code.

See a demo for yourself.

Complexity Visualization of Rake

Estimating Replication Capacity

Jul 21, 2010

It is easy for MySQL replication to become bottleneck when Master server is not seriously loaded and the more cores and hard drives the get the larger the difference becomes, as long as replication
remains single thread process. At the same time it is a lot easier to optimize your system when your replication runs normally – if you need to add/remove indexes and do other schema changes you probably would be looking at some methods involving replication if you can’t take your system down. So here comes the catch in many systems – we find system is in need for optimization when replication can’t catch up but yet optimization process we’re going to use relays on replication being functional and being able to catch up quickly.
So the question becomes how can we estimate replication capacity, so we can deal with replication load before slave is unable to catch up.
Need to replication capacity is not only needed in case you’re planning to use replication to perform system optimization, it is also needed on other cases. For example in sharded environment you may need to schedule downtime or set object read only to move it to another shard. It is much nicer if it can be planned in advance rather than done on emergency basics when slave(s) are unable to catch up and application is suffering because of stale data. This especially applies to Software as Service providers which often have very strict SLA agreements with their customers and which can have a lot of data per customer so move can take considerable amount of time.
So what is replication capacity I call replication capacity the ability to replicate the master load. If replication is able to replicate 3 times the write load from the master without falling behind I will call it replication capacity of 3. When used with context of applying binary logs (for example point in time recovery from backup) replication capacity of 1 will mean you can reply 1 hour worth of binary logs within 1 hour. I will call “replication load” the inverse of replication capacity – this is basically what percentage of time the replication thread was busy replicating events vs staying idle.
Note you can speak about idle replication capacity, when box does not do anything else as well as loaded replication capacity when the box serves the normal load. Both are important. You care about idle replication capacity when you have no load on the slave and need it to catch up or when restoring from backup, the loaded replication capacity matters during normal operation.
So we defined what replication capacity is. There is however no tools which can tell us straight what replication capacity is for the given system. It also tends to float depending on the load similar as loadavg metrics. Here are some of the ways to measure it:
1) Use “UserStats” functionality from Google patches, which is now available in Percona Server and MariaDB. This is the probably the easiest and most accurate approach but it
does not work in Oracle MySQL Server. set userstat_running=1 and run following query:
PLAIN TEXT
SQL:

mysql> SELECT * FROM information_schema.user_statistics WHERE user=”#mysql_system#” \G

*************************** 1. row ***************************

USER: #mysql_system#

TOTAL_CONNECTIONS: 1

CONCURRENT_CONNECTIONS: 0

CONNECTED_TIME: 446

BUSY_TIME: 74

CPU_TIME: 0

BYTES_RECEIVED: 0

BYTES_SENT: 63

BINLOG_BYTES_WRITTEN: 0

ROWS_FETCHED: 0

ROWS_UPDATED: 127576

TABLE_ROWS_READ: 4085689

SELECT_COMMANDS: 0

UPDATE_COMMANDS: 119127

OTHER_COMMANDS: 89557

COMMIT_TRANSACTIONS: 90259

ROLLBACK_TRANSACTIONS: 0

DENIED_CONNECTIONS: 1

LOST_CONNECTIONS: 0

ACCESS_DENIED: 0

EMPTY_QUERIES: 0

1 row IN SET (0.00 sec)

In this case CONNECTED_TIME is 446 second, out of this replication thread was busy (BUSY_TIME) 74 seconds which means replication capacity is 446/74 = 6
You normally would not like to measure it from the start but rather take the difference in these counters every 5 minutes or other interval of your choice.
2) Use full slow query log and mk-query-digest. This method is great for one time execution especially as it comes together with giving you the list of queries which load replication
the most. It however works only with statement level replication. You need to set long_query_time=0 and log_slave_slow_statements=1 for this method to work.
Get the log file which will include all queries MySQL server ran with their times and run mk-query-digest with filter to only check queries from replication thread:
mk-query-digest slow-log –filter ‘($event->{user} || “”) =~ m/[SLAVE_THREAD]/’ > /tmp/report-slave.txt
In the report you will see something like this as a header:
PLAIN TEXT
SQL:

# 475s user time, 1.2s system time, 80.41M rss, 170.38M vsz

# Current date: Mon Jul 19 15:12:24 2010

# Files: slow-log

# Overall: 1.22M total, 1.27k unique, 558.56 QPS, 0.37x concurrency ______

# total min max avg 95% stddev median

# Exec time 819s 1us 92s 669us 260us 120ms 93us

# Lock time 28s 0 166ms 23us 49us 192us 25us

# Rows sent 4.27k 0 325 0.00 0 1.04 0

# Rows exam 30.88M 0 1.28M 26.48 0 3.07k 0

# Time range 2010-07-19 14:35:53 to 2010-07-19 15:12:22

# bytes 350.99M 5 1022.34k 301.01 719.66 5.75k 124.25

# Bytes sen 1.94M 0 9.42k 1.67 0 110.38 0

# Killed 0 0 0 0 0 0 0

# Last errn 34.11M 0 1.55k 29.26 0 185.83 0

# Merge pas 0 0 0 0 0 0 0

# Rows affe 875.19k 0 17.55k 0.73 0.99 25.61 0.99

# Rows read 2.20M 0 14.83k 1.88 1.96 24.68 1.96

# Tmp disk 4.15k 0 1 0.00 0 0.06 0

# Tmp table 14.19k 0 2 0.01 0 0.14 0

# Tmp table 8.30G 0 2.01M 7.12k 0 117.75k 0

# 0% (5k) Filesort

# 0% (5k) Full_join

# 0% (7k) Full_scan

# 0% (10k) Tmp_table

# 0% (4k) Tmp_table_on_disk

There is a lot of interesting you can find out from this header but in relation to replication capacity – you can get replication load, which is same as “concurrency” figure (0.37x) The concurrency as reported by mk-query-digest is sum of query execution time vs time range the log file covers. In this case as we know there is only one replication thread it will be same as replication load. This gives us replication capacity of 1/0.37 = 2.70
This method should work with original MySQL Server in theory, though I have not tested it. Some versions had log_slave_slow_statements unreliable and also you may need to adjust regular expression for finding users replication thread uses.
3) Processlist Pooling This method is simple – the Slave thread has different status in Show Processlist depending on if it processes query or simply waiting. By pooling processlist frequently (for example 10 times a second) we can compute the approximate percentage the thread was busy vs idle. Of course running processlist very aggressively can be an overhead especially if it is busy system with a lot of connections
PLAIN TEXT
SQL:

mysql> SHOW processlist;

+——–+————-+———–+——+———+——+———————————————————————–+——————+

| Id | User | Host | db | Command | Time | State | Info |

+——–+————-+———–+——+———+——+———————————————————————–+——————+

| 801812 | system user | | NULL | Connect | 2665 | Waiting FOR master TO send event | NULL |

| 801813 | system user | | NULL | Connect | 0 | Has READ ALL relay log; waiting FOR the slave I/O thread TO UPDATE it | NULL |

| 802354 | root | localhost | NULL | Query | 0 | NULL | SHOW processlist |

+——–+————-+———–+——+———+——+———————————————————————–+——————+

3 rows IN SET (0.00 sec)

4) Slave Catchup/Binlog Application method. We can just get the spare server with backups restored on it and apply binary log to it. If 1 hour worth of binary logs applies for 10 minutes we have replication capacity of 6. The challenge of course having spare server around and it is quite labor intensive. At the same time it can be good measurement to take during backup recovery trials when you’re doing this activity anyway. Using this way you can also measure “cold” vs “hot” replication capacity as well as how long replication warmup takes. It is very typical for servers with cold cache to perform a lot slower then they are warmed up. Measuring times for each binary log separately should give you these numbers.
The less intrusive process which can be done in production (especially if you have slave which is used for backups/reporting etc) is to stop the replication for some time and when see how long it takes to catch up. If you paused replication for 10 minutes and it took 5 minutes to catch up your replication capacity will be 3 (not 2) because you not only had to process the events for outstanding 10 minutes but also for these 5 minutes it took to catch up. The formula is (Time_Replication_Paused+Time_Took_To_Catchup)/Time_Took_To_Catchup.
So how much of replication capacity do you need in the healthy system ? It depends a lot on many things including how fast do you need to be able to recover from backups and how much your load variance is. A lot of systems have special requirements on the time it takes to warmup too (there are different things you can do about it too). First I would measure replication capacity on 5 minute intervals (or something similar) because it tends to vary a lot. When I would suggest to ensure the loaded replication capacity is at least 3 during the peak load and 5 during the normal load. This applies to normal operational load – if you push heavy ALTER TABLE through replication they will surely get your replication capacity down for their duration.
One more thing about these methods – methods 1,2,3 work well only if replication capacity is above 1, so system is caught up. If it is less than 1, so the master writes more binary logs than slave can process they will show number close to 1. the method 4 however with work even if replication can’t ever catch up – If 1 hour worth of binary logs takes 2 hours to apply, your replication capacity is 0.5.

Entry posted by peter |
No comment
Add to: | | | |

Database Architectures & Performance

Jul 20, 2010

For decades the debate between shared-disk and shared-nothing databases has raged. The shared-disk camp points to the laundry list of functional benefits such as improved data consistency, high-availability, scalability and elimination of partitioning/replication/promotion. The shared-nothing camp shoots back with superior performance and reduced costs. Both sides have a point.First, let?s look at the performance issue. RAM (average access time of 200 nanoseconds) is considerably faster than disk (average access time of 12,000,000 nanoseconds). Let me put this 200:12,000,000 ratio into perspective. A task that takes a single minute in RAM would take 41 days in disk. So why do I bring this up?Shared-Nothing: Since the shared-nothing database has sole ownership of its data?it doesn?t share the data with other nodes?it can operate in the machine?s local RAM, only writing infrequently to disk (flushing the data to disk). This makes shared-nothing databases very fast.Shared-Disk: Cannot rely on the machine?s local RAM, because every write by one node must be instantly available to the other nodes, to ensure that they don?t use stale data and corrupt the database. So instead of relying on local RAM, all write transactions must be written to disk. This is where the 1 minute to 41 days ratio above comes into play and kills performance of shared-disk databases.Let?s look at some of the ways databases can utilize RAM instead of disk to improve performance:Read Cache: Databases typically use the RAM as a fast read cache. Upon reading data from the disk, this data is stored in the read cache so that subsequent use of that data is satisfied from RAM instead of the disk. For example, upon reading a person?s name from disk, that name is stored in the cache for fast access. The database wouldn?t need to read that name from disk again until that person?s name is changed (rare), or that RAM space is reused for a piece of data that is used more frequently. Read cache can significantly improve database performance. BOTH shared-disk and shared-nothing databases can exploit read cache. The shared-disk database just needs a system to either invalidate or update the data in read cache when one of the nodes has made a change. This is pretty standard in shared-disk databases.Background Writing: Writing data to the disk is by far the most time consuming process in a write transaction. During the transaction, that portion of the data is locked, meaning it is unavailable for other functions. So, if you can move the writing of the data outside of the transaction?write the data in the background?you get faster transactions, which means less locking contention, which means faster throughput. SHARED-NOTHING can exploit this performance enhancement, since each server owns the data in its RAM. However, shared-disk databases cannot do this because they need to share that updated data with the other database nodes in the cluster. Since the local node?s cache is not shared, in a shared-disk database, the only option is to use the shared disk to share that data across the nodes.Transactional Cache: The next step in utilizing RAM instead of disk is to use it in a transactional manner. This means that the database can make multiple changes to data in RAM prior to writing the final results to disk. For example, if you have 100 widgets, you can store that inventory count in RAM, and then decrement it with each sale. If you sell 23 widgets, then instead of writing each transaction to disk, you update it in RAM. When you flush this data to disk, it results in a single disk write, writing the inventory number 77, instead of writing each of the 23 transactions individually to disk.SHARED-NOTHING can perform transactions on data while it is in RAM. Once again, shared-disk databases cannot do this because you might have multiple nodes updating the inventory. Since they cannot look into each others local RAM, they must once again write each transaction to disk.As you can see, shared-nothing databases have an inherent performance advantage. The next blog post will address how modern shared-disk databases address these performance challenges.

Tomasz Wegrzanowski: We need syntax for talking about Ruby types

Jul 20, 2010

Koteczek by kemcio from flickr (CC-NC)

All this is about discussing types in blog posts, documentation etc. None of that goes anywhere near actual code (except possibly in comments). Ruby never sees that.

Statically typed languages have all this covered, and we need it too. Not static typing of course – just an expressive way to talk about what types things are – as plain English fails here very quickly. As far as I know nothing like that exists yet, so here’s my proposal.

This system of type descriptions is meant for humans, not machines. It focuses on the most important distinctions, and ignores details that are not important, or very difficult to keep track of. Type descriptions should only be as specific as necessary in given context. If it makes sense, there rules should be violated.

In advance I’ll say I totally ignored all the covariance / contravariance / invariance business – it’s far to complicated, and getting too deeply into such issues makes little sense in a language where everything can be redefined.

Basic types

Types of simple values can be described by their class name, or any of its superclasses or mixins. So some ways to describe type of 15 would be Fixnum (actual class), Integer (superclass), Comparable (mixin), or Object (superclass all the way up).

In context of describing types, everything is considered an Object, and existence of Kernel, BasicObject etc. is ignored.

So far, it should all be rather obvious. Examples:

  • 42Integer
  • Time.now  – Time
  • Dir.glob("*")Enumerable
  • STDINIO

nil and other ignored issues

nil will be treated specially – as if it was of every possible type. nil means absence of value, and doesn’t indicate what type the value would have if it was present. This is messy, but most explicitly typed languages follow this path.

Distinction between situations that allow nils and those that don’t will be treated as all other value range restrictions (Integer must be posibile, IO must be open for writing etc.) – as something outside the type system.

For cases where nil means something magical, and not just absence of value, it should probably be mentioned.

Checked exceptions and related non-local exits in Ruby would be a hopeless thing to even attempt. There’s syntax for exceptions and catches used as control structures if they’re really necessary.

Booleans

We will also pretend that Boolean is a common superclass of TrueClass and FalseClass.

We will also normally ignore distinction between situations where real true/false are expected, and situations where any object goes, but acts identically to its boolean conversion. Any method that acts identically on x and !!x can be said to take Boolean.

On the other hand if some values are treated differently than their double negation, that’s not really Boolean and it deserves a mention. Especially if nil and false are not equivalent – like in Rails’s #in_groups_of (I don’t think Ruby stdlib ever does thing like that).

Duck typing

If something quacks like a Duck convincingly enough, it can be said to be of type Duck, it being object’s responsibility that its cover doesn’t get blown.

In particular, Ruby uses certain methods for automatic type conversion. In many contexts objects implementing #to_str like Pathnames will be treated as Strings, objects implementing #to_ary as Arrays, #to_hash as Hashes, and to_proc as Procs – this can be used for some amazing things like Symbol#to_proc.

This leads to a big complication for us – C code implementing Ruby interpreter and many libraries is normally written in a way that calls these conversion functions automatically, so in such contexts Symbol really is a Proc, Pathname really is a String and so on. On the other hand, in Ruby code these methods are not magical, and such conversions will only happen if explicitly called – for them Pathname and String are completely unrelated types. Unless Ruby code calls C code, which then autoconverts.

Explicitly differentiating between contexts which expect a genuine String and those which expect either that or something with a valid #to_str method would be highly tedious, and I doubt anyone would get it exactly right.

My recommendation would be to treat everything that autoconverts to something as if it subclassed it. So we’ll pretend Pathname is a subclass of String, even though it’s not really. In some cases this will be wrong, but it’s not really all that different from subclassing something and then introducing incompatible changes.

This all doesn’t extend to #to_s, #to_a etc – nothing can be described as String just because it has to_s method – every object has to_s but most aren’t really strings.

Technical explanation of to_str and friends

This section is unrelated to post’s primary subject – skip if uninterested.

Ruby uses special memory layout for basic types like strings and arrays. Performance would be abysmal if string methods had to actually call Ruby code associated with whatever [] happened to be redefined to for every character – instead they ask for a certain C data structure, and access that directly (via some macros providing extra safety and convenience to be really exact).

By the way this is a great example of C being really slow – if Ruby was implemented on a platform with really good JIT, it could plausibly have every single string function implemented in term of calls to [], []=, size, and just a few others, with different subclasses of String providing different implementations, and JIT compiling inlining all that to make it really fast.

It would make it really simple to create class representing a text file, and =~ /regexp/ that directly without reading anything more than required to memory, or maybe even gsub! it in a way that would read it in small chunks, saving them to another file as soon as they’re ready, and then renaming in one go. All that without regexp library knowing anything about it all. It’s all just my fantasy, I’m not saying any such JIT actually exists.

Anyway, strings and such are implemented specially, but we still want these types to be real objects, not like what they’ve done in Java. To make it work, all C functions requiring access to underlying storage call a special macro which automatically calls a method like to_str or to_ary if necessary – so such objects can pretend to be strings very effectively. For example if you alias method to_str to path on File code like system File.open("/bin/hostname") will suddenly start working. It really makes sense only for things which are “essentially strings” like Pathname, URI, Unicode-enhanced strings, proxies for strings in third party libraries like Qt etc.

To complicate things further objects of all classes inheriting from String automatically use String’s data representation – and C code will access that, never calling to_str. This leaves objects which duck type as Strings two choices:

  • Subclass String and every time anything changes update C string data. This can be difficult – if you implement an URI and keep query part as a hash instance variable – you need to somehow make sure that your update code gets run every time that hash changes – like by not exposing it at all and only allowing query updates via your direct methods, or wrapping it in a special object that calls you back.
  • Don’t subclass String, define to_str the way you want. Everything works – except your class isn’t technically a String so it’s not terribly pretty OO design.

You probably won’t be surprised that not subclassing is the more popular choice. As it’s all due to technical limitations not design choices, it makes sense to treat such objects as if they were properly subclassed.

Pussy by tripleigrek from flickr (CC-SA)

Collections

Back to the subject. For collections we often want to describe types of their elements. For simple collections yielding successive elements on #each, syntax for type description is CollectionType[MemberType]. Examples:

  • [42.0, 17.5]Array[Float]
  • Set["foo","bar"]Set[String]
  • 5..10Range[Integer]

When we don’t care about collection type, only about element types, descriptions like Enumerable[ElementType] will do.

Syntax for types of hashtables is Hash[KeyType, ValueType] – in general collections which yield multiple values to #each can be described as CollectionType[Type1, Type2, ..., TypeN].

For example {:foo => "bar"} is of type Hash[Symbol, String].

This is optional – type descriptions like Hash or Enumerable are perfectly valid – and often types are unrelated, or we don’t care.

Not every Enumerable should be treated as collection of members like that – File might technically be File[String] but it’s usually pointless to describe it this way. In 1.8 String is Enumerable, yielding successive lines when iterated – but String[String] make no sense (no longer a problem in 1.9).

Classes other than Enumerable like Delegator might need type parameters, and they should be specified with the same syntax. Their order and meaning depends on particular class, but usually should be obvious.

Literals and tuples

Ruby doesn’t make distinction between Arrays and tuples. What I mean here is a kind of Array which shouldn’t really be treated as a collection, and in which different members have unrelated type and meaning depending on their position.

Like method arguments. It really wouldn’t be useful to say that every method takes Array[Object] (and an optional Proc) – types and meanings of elements in this array should be specified.

Syntax I want for this is [Type1, Type2, *TypeRest] – so for example Hash[Date, Integer]’s #select passes [Date, Integer] to the block, which should return a Boolean result, and then returns either Array[[Date, Integer]] (1.8) or Hash[Date, Integer] (1.9). Notice double [[]]s here – it’s an Array of pairs. In many contexts Ruby automatically unpacks such tuples, so Array[[Date,Integer]] can often be treated as Array[Date,Integer] – but it doesn’t go deeper than one level, and if you need this distinction it’s available.

Extra arguments can be specified with *Type or ... which is treated here as *Object. If you want to specify some arguments as optional suffix their types with ? (the most obvious [] having too many uses already, and = not really fitting right).

In this syntax [*Foo] is pretty much equivalent to Array[Foo], or possibly Enumerable[Foo] (with some duck typing) – feel free to use that if it makes things clearer.

Basic literals like true, false, nil stand for themselves – and for entire TrueClass, FalseClass, NilClass classes too as they’re their only members. Other literals such as symbols, strings, numbers etc. can be used too when needed.

To describe keyword arguments and hashes used in similar way, syntax is {Key1=>Type1, Key2=>Type2} – specifying exact key, and type of value like {:noop=>Boolean, :force=>Boolean}.

It should be assumed that keys other than those listed are ignored, cause exception, or are otherwise not supported. If they’re meaningful it should be marked with ... like this {:query=>String, ...}. Subclasses often add extra keyword arguments, and this issue is ignored.

Functions

Everything so far was just a prelude to the most important part of any type system – types for functions. Syntax I’d propose it: ArgumentTypes -> ReturnType (=> being already used by hashes).

I cannot decide if blocks should be specified in Ruby-style notation or a function notation, so both  & {|BlockArgumentTypes| BlockReturnType} and &(BlockArgumentTypes->BlockReturnType) are valid. & is necessary, as block are passed separately from normal arguments, however strong the temptation to reuse -> and let the context disambiguate might be.

Blocks that don’t take any arguments or don’t return anything can drop that part, leaving only something like &{|X|}, &{Y}, &{}, or in more functional notation &(X->), &(Y), &().

Because of all the [] unpacking, using [] around arguments, tuple return values etc. is optional – and just like in Ruby () can be used instead in such contexts.

If function doesn’t take any arguments, or returns no values, these parts can be left – leaving perhaps as little as ->.

Examples:

  • In context of %w[Hello world !].group_by(&:size) method #group_by has type Array[String]&{|String| Integer}->Hash[Integer,String]
  • Time.at has type Numeric -> Time
  • String#tr has type [String, String] -> String
  • On a collection of Floats, #find would have type Float?&(Float->Boolean)->Float
  • Function which takes no arguments and returns no values has type []->nil

If you really need to specify exceptions and throws, you can add raises Type, or throws :kind after return value.  Use only for control structure exceptions, not for actual errors exceptions. It might actually be useful if actual data gets passed around.

  • Find.find has type [String*]&(String->nil throws :prune)->nil

A standalone Proc can be described as (ArgumentsTypes->ReturnType) just as with notation for functions. There is no ambiguity between Proc arguments and block arguments, as blocks are always marked with |.

Type variable and everything else

In addition to names of real classes, any name starting with an uppercase letter should be consider a type. Unless it’s specified otherwise in context, all such unknown  names should be considered class variables with big forall quantifier in front of it all.

Examples:

  • Enumerable[A]#partition has type &(B->Boolean)->[Array[A], Array[A]]
  • Hash[A,B]#merge has type Hash[A,B]&(A,B,B->B)->Hash[A,B]
  • Array[A]#inject has either type B&(B,A->B)->B or &(A,A)->A. This isn’t just a usual case of missing argument being substituted by nil – these are two completely different functions.

To specify that multiple types are allowed (usually implying that behaviour will be different, otherwise there should be a superclass somewhere, or we could treat it as common duck typing and ignore it) join them with |. If there’s ambiguity between this use and block arguments, parenthesize. It binds more tightly than ,, so it only applies to one argument. Example:

  • String#index in 1.8 has type (String|Integer|Regexp, Integer?)->Integer (and notice how I ignored Fixnums here).

For functions that can be called in multiple unrelated ways, just list them separately – | and parentheses will work, but they are usually top level, and not needed anywhere deeper.

If you want to specify type of self, prefix function specification with Type#:

  • #sort has type like Enumerable[A]#()&(A,A->1|0|-1)->Array[A]

To specify that something takes range of values not really corresponding to a Ruby class, just define such extra names somewhere and then use like this:

  • File#chown has type (UnixUserId, UnixUserId)->0 – with UnixUserId being a pretend subclass of Integer, and 0 is literal value actually returned.

To specify that something needs a particular methods just make up a pretend mixin like Meowable for #meow.

Any obvious extensions to this notation can be used, like this:

  • Enumerable[A]#zip has type (Enumerable[B_1], *Enumerable[B_i])->Array[A, B_1, *B_i] – with intention that B_is will be different for each argument understood from context. (I don’t think any static type system handles cases like this one reasonably – most require separate case for each supported tuple length, and you cannot use arrays if you mix types. Am I missing something?)

The End

Well, what I really wanted to do what talk about Ruby collection system, and how 1.9 doesn’t go far enough in its attempts at fixing it. And without notation for types talking about high order functions that operate on collections quickly turns into a horrible mess. So I started with a brief explanation of notation I wanted to use, and then I figured out I can as well do it right and write something that will be reusable in other contexts too.

Most discussion of type systems concerns issues like safety and flexibility, which don’t concern me at all, and limit themselves to type systems usable by machines.

I need types for something else – as statements about data flow. Type signature like Enumerable[A]#()&(A->B)->Hash[A,B] doesn’t tell you exactly what such function does but narrows set of possibilities extremely quickly. What it describes is a function which iterates over collection in order while building a Hash to be returned, using collection’s elements as keys, and values returned by the block as values. Can you guess the function I was thinking about here?

Now a type like that is not a complete specification – a function that returns an empty hash fits it. As does one which skips every 5th element. And one that only keeps entries with unique block results. And for that matter also one that sends your email password to NSA – at least assuming it returns that Hash afterwards.

It was still pretty useful. How about some of those?

  • Hash[A,B] -> Hash[B, Array[A]]
  • Hash[A,B] &(A,B->C) -> Hash[A,C]
  • Hash[A, Hash[B,C]] -> Hash[[A,B], C]
  • Hash[A,B] &(A,B->C) -> Hash[C, Hash[A,B]]
  • Enumerable[Hash[A,B]] &(A,B,B->B) -> Hash[A,B]
  • Hash[A,Set[B]] -> Hash[Set[A], Set[B]]

Even these short snippets should give a pretty good idea what these are all about.

That’s it for now. Hopefully it won’t be long until that promised 1.9 collections post.

linux.conf.au 2011 CFP Open!

Jul 15, 2010

Head on over to http://lca2011.linux.org.au/ and check it out!
You’ve got until August 7th to put in a paper, miniconf, poster or tutorial.
Things I’d like to see come from my kinda world:

topics on running large numbers of machines
latest in large scale web infrastructure
latest going on in the IO space: (SSD, filesystems, SSD as L2 cache)
Applications of above technologies and what it means for application performance
Scalable and massive tcp daemons (i.e. Eric should come talk on scalestack)
exploration of pain points in current technologies and discussion on ways to fix them (from people really in the know)
A Hydra tutorial: starting with stock Ubuntu lucid, and exiting the tutorial with some analysis running on my project.
Something that completely takes me off guard and is awesome.

I’d love to see people from the MySQL, Drizzle and Rackspace worlds have a decent presence. For those who’ve never heard of/been to an LCA before: we reject at least another whole conference worth of papers. It’s the conference on the calendar that everything else moves around.

Share this on Facebook

Tweet This!

Share this on del.icio.us

Digg this!

Post on Google Buzz

OpenSQL Camp Europe: Time to cast your votes!

Jul 14, 2010


If you wonder why there hasn’t been an update from me for quite a while — I just returned from two months of paternal leave, in which I actually managed to stay away from the PC most of the time. In the meanwhile, I’ve officially become an Oracle employee and there is a lot of administrative things to take care of… But it feels good to be back!

During my absence, Giuseppe and Felix kicked off the Call for Papers for this year’s European OpenSQL Camp, which will again take place in parallel to FrOSCon in St. Augustin (Germany) on August 21st/22nd. We’ve received a number of great submissions, now we would like to ask our community about your favourites!

Basically it’s “one vote per person per session” and you can cast your votes in two ways, either by twittering @opensqlcamp or via the opensqlcamp mailing list. The procedure is outlined in more detail on this wiki page.

As we need to finalize the schedule and inform the speakers, the voting period will close this coming Sunday, 18th of July. So don’t hesitate, cast your votes now! Based on your feedback we will compile the session schedule for this year’s camp. Thanks for your help!