| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cccybernetic 968 days ago

I'm glad there's growing competition, but GPT4 still outperforms. Here's a recent question I asked:

Could you please provide me with a query for my PostgreSQL table? It consists of three columns: 'id,' 'teaser,' and 'full_text,' all of which are of type 'text.' Most 'full_text' entries are articles around 1000 words in length. I'd like to update the 'teaser' column for each row with the first approximately 200 words from the 'full_text' column. How can I achieve this?

Phind:

  CREATE OR REPLACE FUNCTION get_first_n_words(text, integer)
  RETURNS text AS $$
  DECLARE
     word_count integer := 0;
     word_start integer := 1;
     word_end integer;
  BEGIN
     WHILE word_count < \$2 AND word_start <= length(\$1) LOOP
         word_end := position(' ' in \$1 from word_start) + 1;
         IF word_end = 0 THEN
             word_end := length(\$1) + 1;
         END IF;
         word_count := word_count + 1;
         word_start := word_end;
     END LOOP;
     RETURN substring(\$1 from 1 for word_start - 1);
  END;
  $$ LANGUAGE plpgsql;

GPT4:

  UPDATE your_table_name
  SET teaser = (
      SELECT STRING_AGG(word, ' ') 
      FROM (
          SELECT SPLIT_PART(full_text, ' ', i) AS word
          FROM generate_series(1, 200) AS i
      ) AS words
      WHERE word <> ''
  )
  WHERE full_text IS NOT NULL;

4 comments

rushingcreek 968 days ago

Running "Ignore Web Context" enabled can improve performance for design tasks like this. I just got a more plausible answer: https://www.phind.com/search?cache=f0fkv5mxscwvagxgkuwnwgtl. Consistency is something we're working on.

cccybernetic 968 days ago

Thanks for sharing, you're right - that does improve performance!

ta8645 968 days ago

How do you enable "Ignore Web Context"? I don't see that option anywhere on the page you linked, am I just being blind?

rushingcreek 968 days ago

It's in the model dropdown under the search bar.

raducu 968 days ago

You mean "Ignore Search Results" ?

riku_iki 968 days ago

One example is not enough for performance conclusions

cccybernetic 968 days ago

Obviously not. Perfectly reasonable to share anecdotes though.

Also, I ran a few different tests, and every GPT-4 response was superior, but I didn't want to clutter my comment with queries and code.

Wytwwww 968 days ago

There is a performance conclusion in the title though.

riku_iki 968 days ago

That conclusion is based on benchmark with many examples in different tasks.

emptysongglass 968 days ago

That conclusion is based on their benchmarks. I'm not interested in those. I'm interested in community benchmarks, like those we're seeing in the comments. Lo and behold, GPT-4 is still king. The claims of any company should be taken with exactly a pinch of salt.

riku_iki 968 days ago

that benchmark(HumanEval) is some public benchmark built by others.

PoignardAzur 967 days ago

That kind of benchmark is a lot more reliable for models published before the benchmarks; models published afterwards have more opportunity to "study to the test". That's especially a concern when a company explicitly uses its score on that benchmark as a marketing point.

spmurrayzzz 968 days ago

AFAIK they haven't released the dataset they fine-tuned on, so we can't be 100% there wasn't benchmark contamination. Agree that we definitely need more than N=1 to challenge the performance claims, but I still think its valid to call it out given how much benchmarking-gaming we've seen in this space.

riku_iki 968 days ago

I think you can bring contamination claim to every public benchmark results nowdays: models are trained on TBs of data crawled from internet, and there is no guarantee benchmark is not leaked in some way.

spmurrayzzz 968 days ago

With respect to the pretraining data, its true that we're probably SOL there in terms of verification. But for fine-tuning, they could still publish the dataset and see if others can reproduce their results as well as audit for contamination.

If we're comparing benchmark deltas between different fine-tuned variants that share the same base models, that seems like the bare minimum we should expect to come along with performance claims.

Wytwwww 968 days ago

From what I understand it's a single test suite? Of course I don't really mind the clickbait title that much, it's hard to attract attention otherwise.

riku_iki 968 days ago

I think it is valid criticism that that HumanEval benchmark is not completely representative, they also say it in the post.

amelius 968 days ago

Depends on the claims made.

gardenhedge 967 days ago

With some simple clarification I got this

UPDATE your_table SET teaser = substring(full_text from '(\S+\s*){1,200}')

nofunsir 968 days ago

I really dislike article teasers and "read more" buttons. Now I know it's intentional clipping of the corresponding articles.